Discussion:
[darktable-user] opencl memory issue with GUI vs. commandline
Matthias Bodenbinder
7 years ago
Permalink
Hi,

I have a weird opencl memory issue. I just got a new NVIDIA GFX 1050 Ti and wanted to do some benchmarking. I typically use the benchmarking which has been described several times here in this group as well. I run


darktable-cli bench.SRW test.jpg --core -d perf -d opencl

and see how long pixpelpipe processing takes.

But this time I wanted to this with a bigger RAW. Something that is really stressing the GPU. SO I got RAW_CANON_EOS_5DS.CR2 from http://www.rawsamples.ch/raws/canon/RAW_CANON_EOS_5DS.CR2


darktable-cli RAW_CANON_EOS_5DS.CR2 --core -d perf -d opencl
That works fine. NO memory issue.

Then I open the DT GUI and export a couple of my Canon 6D files. And this also works fine.

But then I go back to the commandline and try again with RAW_CANON_EOS_5DS.CR2 and I get

1,018442 [default_process_tiling_cl_ptp] use tiling on module 'denoiseprofile' for image with full size 8734 x 5856
1,018451 [default_process_tiling_cl_ptp] (2 x 1) tiles with max dimensions 8128 x 5856 and overlap 8
1,018453 [default_process_tiling_cl_ptp] tile (0, 0) with 8128 x 5856 at origin [0, 0]
11,549347 [opencl_denoiseprofile] couldn't enqueue kernel! -4
11,553563 [default_process_tiling_opencl_ptp] couldn't run process_cl() for module 'denoiseprofile' in tiling mode: 0
11,553568 [opencl_pixelpipe] could not run module 'denoiseprofile' on gpu. falling back to cpu path


Can it be that the DT GUI is not releasing all the GPU memory and then the commandline fails? And the worst thing is that the system is not recovering from that. Even with bench.SRW which is a lot smaller it is not working anymore. The GPU memory is gone.

3# ll bench.SRW *.CR2
-rw-r--r-- 1 matthias matthias 20453024 17. Jan 2017 bench.SRW
-rw-r--r--+ 1 matthias matthias 67127952 16. Feb 2017 RAW_CANON_EOS_5DS.CR2

Matthias

PS
I have set opencl_memory_headroom=400. But it does not help.
Matthias Bodenbinder
7 years ago
Permalink
Post by Matthias Bodenbinder
3# ll bench.SRW *.CR2
-rw-r--r-- 1 matthias matthias 20453024 17. Jan 2017 bench.SRW
-rw-r--r--+ 1 matthias matthias 67127952 16. Feb 2017 RAW_CANON_EOS_5DS.CR2
bench.SRW can be downloaded here: http://www.mirada.ch/bench.SRW

I use the attached XMP files.

Matthias
Ulrich Pegelow
7 years ago
Permalink
Post by Matthias Bodenbinder
Can it be that the DT GUI is not releasing all the GPU memory and then the commandline fails? And the worst thing is that the system is not recovering from that. Even with bench.SRW which is a lot smaller it is not working anymore. The GPU memory is gone.
You should check on your system with nvidia-smi (typically comes as part
of the nvidia-compute package). The tool will tell you the amount of
used GPU memory. You should be able to see if it differs before and
after running darktable.

You may also try 'darktable -d opencl -d memory' which gives you a log
of darktable's opencl memory usage. At the end it should go down back to
zero and also tell you the peak memory usage.

Ulrich
Matthias Bodenbinder
7 years ago
Permalink
...
Ok. I did the testing again with -d memory and with nvidia-smi in parallel.

I am repeating the following cycle:

1st: on the commadline darktable-cli with the big Canon 5Ds RAW file
and measuring with nvidia-smi
2nd: in darktable GUI export of 26 canon 6D raw files
and measuring with nvidia-smi
3rd: back to 1st

When I repeat this cycle I see with nvidia-smi that the GPU memory is not fully released.

When I start it is:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 396.24 Driver Version: 396.24 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 105... Off | 00000000:01:00.0 On | N/A |
| 20% 29C P0 N/A / 75W | 205MiB / 4032MiB | 0% Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 3161 G /usr/lib/xorg-server/Xorg 119MiB |
| 0 3204 G /usr/bin/gnome-shell 55MiB |
| 0 3595 G /usr/lib/thunderbird/thunderbird 27MiB |
+-----------------------------------------------------------------------------+

after about 3-4 cycles it is:
Thu May 24 17:36:40 2018
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 396.24 Driver Version: 396.24 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 105... Off | 00000000:01:00.0 On | N/A |
| 21% 47C P0 N/A / 75W | 365MiB / 4032MiB | 0% Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 3161 G /usr/lib/xorg-server/Xorg 177MiB |
| 0 3204 G /usr/bin/gnome-shell 139MiB |
| 0 3595 G /usr/lib/thunderbird/thunderbird 29MiB |
| 0 12453 G ...-token=A9014C684CDE331E22E41D13D5C24B83 15MiB |
+-----------------------------------------------------------------------------+

And this is where the commandline with the big Canon 5Ds file is getting into memory issues:

[memory] after successful startup
[memory] max address space (vmpeak): 14048600 kB
[memory] cur address space (vmsize): 14048600 kB
[memory] max used memory (vmhwm ): 173900 kB
[memory] cur used memory (vmrss ): 173900 kB
0,577826 [dev] took 0,358 secs (0,294 CPU) to load the image.
0,657177 [export] creating pixelpipe took 0,079 secs (0,165 CPU)
0,657193 [pixelpipe_process] [export] using device 0
[memory] before pixelpipe process
[memory] max address space (vmpeak): 16013512 kB
[memory] cur address space (vmsize): 16013512 kB
[memory] max used memory (vmhwm ): 385344 kB
[memory] cur used memory (vmrss ): 285088 kB
0,657246 [dev_pixelpipe] took 0,000 secs (0,000 CPU) initing base buffer [export]
0,657260 [opencl memory] device 0: 105328640 bytes in use
0,667911 [opencl memory] device 0: 309913856 bytes in use
0,669008 [opencl memory] device 0: 309913872 bytes in use
0,669015 [opencl memory] device 0: 309913888 bytes in use
0,669421 [opencl memory] device 0: 309913872 bytes in use
0,675027 [opencl memory] device 0: 309913856 bytes in use
0,675107 [opencl memory] device 0: 204585216 bytes in use
0,675292 [dev_pixelpipe] took 0,018 secs (0,015 CPU) processed `Raw-Schwarz-/Weißpunkt' on GPU, blended on GPU [export]
0,675299 [opencl memory] device 0: 409170432 bytes in use
0,676389 [opencl memory] device 0: 409170444 bytes in use
0,678972 [opencl memory] device 0: 409170432 bytes in use
0,684469 [opencl memory] device 0: 204585216 bytes in use
0,684793 [dev_pixelpipe] took 0,009 secs (0,005 CPU) processed `Weißabgleich' on GPU, blended on GPU [export]
0,684800 [opencl memory] device 0: 409170432 bytes in use
0,693756 [opencl memory] device 0: 204585216 bytes in use
0,694007 [dev_pixelpipe] took 0,009 secs (0,008 CPU) processed `Spitzlicht-Rekonstruktion' on GPU, blended on GPU [export]
0,694014 [opencl memory] device 0: 1022926080 bytes in use
0,695104 [opencl memory] device 0: 1841266944 bytes in use
0,700809 [opencl memory] device 0: 1022926080 bytes in use
0,751384 [opencl memory] device 0: 818340864 bytes in use
0,755737 [dev_pixelpipe] took 0,062 secs (0,018 CPU) processed `Entrastern' on GPU, blended on GPU [export]
1,016704 [opencl memory] device 0: 0 bytes in use
1,018748 [default_process_tiling_cl_ptp] use tiling on module 'denoiseprofile' for image with full size 8734 x 5856
1,018754 [default_process_tiling_cl_ptp] (2 x 1) tiles with max dimensions 8128 x 5856 and overlap 8
1,018756 [default_process_tiling_cl_ptp] tile (0, 0) with 8128 x 5856 at origin [0, 0]
1,018760 [opencl memory] device 0: 761561088 bytes in use
1,018762 [opencl memory] device 0: 1523122176 bytes in use
1,113819 [opencl memory] device 0: 2284683264 bytes in use
1,113829 [opencl memory] device 0: 3046244352 bytes in use
1,113831 [opencl memory] device 0: 3236634624 bytes in use
1,113837 [opencl memory] device 0: 3427024896 bytes in use
1,113839 [opencl memory] device 0: 3617415168 bytes in use
1,113841 [opencl memory] device 0: 3807805440 bytes in use
11,528715 [opencl memory] device 0: 3617415168 bytes in use
11,528904 [opencl memory] device 0: 3427024896 bytes in use
11,531146 [opencl memory] device 0: 3236634624 bytes in use
11,533251 [opencl memory] device 0: 3046244352 bytes in use
11,537523 [opencl memory] device 0: 2284683264 bytes in use
11,542206 [opencl memory] device 0: 1523122176 bytes in use
11,546235 [opencl_denoiseprofile] couldn't enqueue kernel! -4
11,546240 [opencl memory] device 0: 761561088 bytes in use
11,550443 [opencl memory] device 0: 0 bytes in use
11,550447 [default_process_tiling_opencl_ptp] couldn't run process_cl() for module 'denoiseprofile' in tiling mode: 0
11,550450 [opencl_pixelpipe] could not run module 'denoiseprofile' on gpu. falling back to cpu path
36,835311 [dev_pixelpipe] took 36,080 secs (206,513 CPU) processed `Entrauschen (Profil)' on CPU, blended on CPU [export]
36,835361 [opencl memory] device 0: 818340864 bytes in use
36,926261 [opencl memory] device 0: 1636681728 bytes in use
36,927550 [opencl memory] device 0: 1636943872 bytes in use
36,927559 [opencl memory] device 0: 1636943884 bytes in use
36,928372 [opencl memory] device 0: 1636681740 bytes in use
36,962240 [opencl memory] device 0: 1636681728 bytes in use
36,962379 [opencl memory] device 0: 818340864 bytes in use
36,963098 [dev_pixelpipe] took 0,128 secs (0,183 CPU) processed `Basiskurve' on GPU, blended on GPU [export]
36,963108 [opencl memory] device 0: 1636681728 bytes in use
36,964171 [opencl memory] device 0: 1636681764 bytes in use
36,964177 [opencl memory] device 0: 1636681800 bytes in use
36,964276 [opencl memory] device 0: 1636943944 bytes in use
36,964372 [opencl memory] device 0: 1637206088 bytes in use
36,964468 [opencl memory] device 0: 1637468232 bytes in use
36,964471 [opencl memory] device 0: 1637468268 bytes in use
36,978946 [opencl memory] device 0: 1637468232 bytes in use
37,010210 [opencl memory] device 0: 1637468196 bytes in use
37,010222 [opencl memory] device 0: 1637206052 bytes in use
37,010228 [opencl memory] device 0: 1636943908 bytes in use
37,010232 [opencl memory] device 0: 1636681764 bytes in use
37,010575 [opencl memory] device 0: 1636681728 bytes in use
37,010693 [opencl memory] device 0: 818340864 bytes in use
37,011223 [dev_pixelpipe] took 0,048 secs (0,034 CPU) processed `Eingabefarbprofil' on GPU, blended on GPU [export]
37,011230 [opencl memory] device 0: 1636681728 bytes in use
37,012426 [opencl memory] device 0: 2455022592 bytes in use
37,012433 [opencl memory] device 0: 2455022636 bytes in use
37,022835 [opencl memory] device 0: 2455022592 bytes in use
37,120083 [opencl memory] device 0: 1636681728 bytes in use
37,120613 [opencl memory] device 0: 818340864 bytes in use
37,125276 [dev_pixelpipe] took 0,114 secs (0,076 CPU) processed `Schärfen' on GPU, blended on GPU [export]
37,125284 [opencl memory] device 0: 1636681728 bytes in use
37,126477 [opencl memory] device 0: 1636681764 bytes in use
37,126560 [opencl memory] device 0: 1636943908 bytes in use
37,126630 [opencl memory] device 0: 1637206052 bytes in use
37,126702 [opencl memory] device 0: 1637468196 bytes in use
37,126705 [opencl memory] device 0: 1637468232 bytes in use
37,141732 [opencl memory] device 0: 1637468196 bytes in use
37,202925 [opencl memory] device 0: 1637206052 bytes in use
37,202937 [opencl memory] device 0: 1636943908 bytes in use
37,202942 [opencl memory] device 0: 1636681764 bytes in use
37,203554 [opencl memory] device 0: 1636681728 bytes in use
37,203667 [opencl memory] device 0: 818340864 bytes in use
37,204190 [dev_pixelpipe] took 0,079 secs (0,044 CPU) processed `Ausgabefarbprofil' on GPU, blended on GPU [export]
37,269189 [opencl memory] device 0: 0 bytes in use
37,309177 [dev_pixelpipe] took 0,105 secs (0,363 CPU) processed `Gamma' on CPU, blended on CPU [export]
37,309216 [opencl_profiling] profiling device 0 ('GeForce GTX 1050 Ti'):
37,309220 [opencl_profiling] spent 0,1814 seconds in [Write Image (from host to device)]
37,309223 [opencl_profiling] spent 0,0056 seconds in rawprepare_1f
37,309226 [opencl_profiling] spent 0,0054 seconds in whitebalance_1f
37,309228 [opencl_profiling] spent 0,0058 seconds in highlights_1f_clip
37,309231 [opencl_profiling] spent 0,0254 seconds in ppg_demosaic_green
37,309234 [opencl_profiling] spent 0,0244 seconds in ppg_demosaic_redblue
37,309236 [opencl_profiling] spent 0,0017 seconds in border_interpolate
37,309240 [opencl_profiling] spent 0,3254 seconds in [Read Image (from device to host)]
37,309242 [opencl_profiling] spent 0,0219 seconds in denoiseprofile_precondition
37,309245 [opencl_profiling] spent 0,0076 seconds in denoiseprofile_init
37,309247 [opencl_profiling] spent 2,0810 seconds in denoiseprofile_dist
37,309250 [opencl_profiling] spent 0,8712 seconds in denoiseprofile_horiz
37,309252 [opencl_profiling] spent 2,1148 seconds in denoiseprofile_vert
37,309254 [opencl_profiling] spent 5,1772 seconds in denoiseprofile_accu
37,309257 [opencl_profiling] spent 0,0337 seconds in basecurve_lut
37,309259 [opencl_profiling] spent 0,0312 seconds in colorin_unbound
37,309262 [opencl_profiling] spent 0,0357 seconds in sharpen_hblur
37,309264 [opencl_profiling] spent 0,0207 seconds in sharpen_vblur
37,309267 [opencl_profiling] spent 0,0447 seconds in sharpen_mix
37,309269 [opencl_profiling] spent 0,0612 seconds in colorout
37,309271 [opencl_profiling] spent 11,0762 seconds totally in command queue (with 1 event missing)
37,309278 [dev_process_export] pixel pipeline processing took 36,652 secs (207,261 CPU)
[export_job] exported to `test-1.jpg'
37,787315 [opencl_summary_statistics] device 'GeForce GTX 1050 Ti' (0): peak memory usage 3807805440 bytes
37,787326 [opencl_summary_statistics] device 'GeForce GTX 1050 Ti' (0): 499 out of 500 events were successful and 1 events lost




####
# with the first tries I did not have the memory issue and it looked like this:

[memory] after successful startup
[memory] max address space (vmpeak): 14048600 kB
[memory] cur address space (vmsize): 14048600 kB
[memory] max used memory (vmhwm ): 174320 kB
[memory] cur used memory (vmrss ): 174320 kB
0,569022 [dev] took 0,352 secs (0,285 CPU) to load the image.
0,651794 [export] creating pixelpipe took 0,082 secs (0,187 CPU)
0,651810 [pixelpipe_process] [export] using device 0
[memory] before pixelpipe process
[memory] max address space (vmpeak): 16013444 kB
[memory] cur address space (vmsize): 16013444 kB
[memory] max used memory (vmhwm ): 385628 kB
[memory] cur used memory (vmrss ): 285376 kB
0,651859 [dev_pixelpipe] took 0,000 secs (0,000 CPU) initing base buffer [export]
0,651872 [opencl memory] device 0: 105328640 bytes in use
0,662504 [opencl memory] device 0: 309913856 bytes in use
0,663567 [opencl memory] device 0: 309913872 bytes in use
0,663570 [opencl memory] device 0: 309913888 bytes in use
0,663956 [opencl memory] device 0: 309913872 bytes in use
0,668663 [opencl memory] device 0: 309913856 bytes in use
0,668743 [opencl memory] device 0: 204585216 bytes in use
0,668907 [dev_pixelpipe] took 0,017 secs (0,010 CPU) processed `Raw-Schwarz-/Weißpunkt' on GPU, blended on GPU [export]
0,668913 [opencl memory] device 0: 409170432 bytes in use
0,669972 [opencl memory] device 0: 409170444 bytes in use
0,671924 [opencl memory] device 0: 409170432 bytes in use
0,677295 [opencl memory] device 0: 204585216 bytes in use
0,677508 [dev_pixelpipe] took 0,009 secs (0,001 CPU) processed `Weißabgleich' on GPU, blended on GPU [export]
0,677515 [opencl memory] device 0: 409170432 bytes in use
0,687658 [opencl memory] device 0: 204585216 bytes in use
0,687909 [dev_pixelpipe] took 0,010 secs (0,003 CPU) processed `Spitzlicht-Rekonstruktion' on GPU, blended on GPU [export]
0,687915 [opencl memory] device 0: 1022926080 bytes in use
0,689002 [opencl memory] device 0: 1841266944 bytes in use
0,693816 [opencl memory] device 0: 1022926080 bytes in use
0,738143 [opencl memory] device 0: 818340864 bytes in use
0,742500 [dev_pixelpipe] took 0,055 secs (0,030 CPU) processed `Entrastern' on GPU, blended on GPU [export]
1,003227 [opencl memory] device 0: 0 bytes in use
1,005273 [default_process_tiling_cl_ptp] use tiling on module 'denoiseprofile' for image with full size 8734 x 5856
1,005279 [default_process_tiling_cl_ptp] (2 x 1) tiles with max dimensions 8128 x 5856 and overlap 8
1,005281 [default_process_tiling_cl_ptp] tile (0, 0) with 8128 x 5856 at origin [0, 0]
1,005286 [opencl memory] device 0: 761561088 bytes in use
1,005288 [opencl memory] device 0: 1523122176 bytes in use
1,099501 [opencl memory] device 0: 2284683264 bytes in use
1,099511 [opencl memory] device 0: 3046244352 bytes in use
1,099513 [opencl memory] device 0: 3236634624 bytes in use
1,099519 [opencl memory] device 0: 3427024896 bytes in use
1,099521 [opencl memory] device 0: 3617415168 bytes in use
1,099522 [opencl memory] device 0: 3807805440 bytes in use
11,516149 [opencl memory] device 0: 3617415168 bytes in use
11,548085 [opencl memory] device 0: 3427024896 bytes in use
11,550326 [opencl memory] device 0: 3236634624 bytes in use
11,552427 [opencl memory] device 0: 3046244352 bytes in use
11,554597 [opencl memory] device 0: 2284683264 bytes in use
11,559091 [opencl memory] device 0: 1523122176 bytes in use
11,810621 [opencl memory] device 0: 761561088 bytes in use
11,811683 [opencl memory] device 0: 0 bytes in use
11,816696 [default_process_tiling_cl_ptp] tile (1, 0) with 622 x 5856 at origin [8112, 0]
11,816703 [opencl memory] device 0: 58278912 bytes in use
11,816706 [opencl memory] device 0: 116557824 bytes in use
11,833042 [opencl memory] device 0: 174836736 bytes in use
11,833051 [opencl memory] device 0: 233115648 bytes in use
11,833053 [opencl memory] device 0: 247685376 bytes in use
11,833055 [opencl memory] device 0: 262255104 bytes in use
11,833057 [opencl memory] device 0: 276824832 bytes in use
11,833059 [opencl memory] device 0: 291394560 bytes in use
12,569053 [opencl memory] device 0: 276824832 bytes in use
12,570918 [opencl memory] device 0: 262255104 bytes in use
12,571352 [opencl memory] device 0: 247685376 bytes in use
12,571788 [opencl memory] device 0: 233115648 bytes in use
12,572220 [opencl memory] device 0: 174836736 bytes in use
12,572684 [opencl memory] device 0: 116557824 bytes in use
12,586578 [opencl memory] device 0: 58278912 bytes in use
12,586720 [opencl memory] device 0: 0 bytes in use
12,588392 [dev_pixelpipe] took 11,846 secs (7,755 CPU) processed `Entrauschen (Profil)' on GPU with tiling, blended on CPU [export]
12,588401 [opencl memory] device 0: 818340864 bytes in use
12,672145 [opencl memory] device 0: 1636681728 bytes in use
12,673425 [opencl memory] device 0: 1636943872 bytes in use
12,673433 [opencl memory] device 0: 1636943884 bytes in use
12,674257 [opencl memory] device 0: 1636681740 bytes in use
12,706723 [opencl memory] device 0: 1636681728 bytes in use
12,706837 [opencl memory] device 0: 818340864 bytes in use
12,707443 [dev_pixelpipe] took 0,119 secs (0,104 CPU) processed `Basiskurve' on GPU, blended on GPU [export]
12,707452 [opencl memory] device 0: 1636681728 bytes in use
12,708645 [opencl memory] device 0: 1636681764 bytes in use
12,708652 [opencl memory] device 0: 1636681800 bytes in use
12,708748 [opencl memory] device 0: 1636943944 bytes in use
12,708840 [opencl memory] device 0: 1637206088 bytes in use
12,708932 [opencl memory] device 0: 1637468232 bytes in use
12,708935 [opencl memory] device 0: 1637468268 bytes in use
12,723289 [opencl memory] device 0: 1637468232 bytes in use
12,752757 [opencl memory] device 0: 1637468196 bytes in use
12,752768 [opencl memory] device 0: 1637206052 bytes in use
12,752773 [opencl memory] device 0: 1636943908 bytes in use
12,752778 [opencl memory] device 0: 1636681764 bytes in use
12,753118 [opencl memory] device 0: 1636681728 bytes in use
12,753225 [opencl memory] device 0: 818340864 bytes in use
12,753745 [dev_pixelpipe] took 0,046 secs (0,022 CPU) processed `Eingabefarbprofil' on GPU, blended on GPU [export]
12,753752 [opencl memory] device 0: 1636681728 bytes in use
12,754946 [opencl memory] device 0: 2455022592 bytes in use
12,754953 [opencl memory] device 0: 2455022636 bytes in use
12,765358 [opencl memory] device 0: 2455022592 bytes in use
12,860049 [opencl memory] device 0: 1636681728 bytes in use
12,860570 [opencl memory] device 0: 818340864 bytes in use
12,865228 [dev_pixelpipe] took 0,111 secs (0,073 CPU) processed `Schärfen' on GPU, blended on GPU [export]
12,865235 [opencl memory] device 0: 1636681728 bytes in use
12,866429 [opencl memory] device 0: 1636681764 bytes in use
12,866504 [opencl memory] device 0: 1636943908 bytes in use
12,866569 [opencl memory] device 0: 1637206052 bytes in use
12,866637 [opencl memory] device 0: 1637468196 bytes in use
12,866640 [opencl memory] device 0: 1637468232 bytes in use
12,881695 [opencl memory] device 0: 1637468196 bytes in use
12,941271 [opencl memory] device 0: 1637206052 bytes in use
12,941280 [opencl memory] device 0: 1636943908 bytes in use
12,941285 [opencl memory] device 0: 1636681764 bytes in use
12,941892 [opencl memory] device 0: 1636681728 bytes in use
12,942000 [opencl memory] device 0: 818340864 bytes in use
12,942517 [dev_pixelpipe] took 0,077 secs (0,045 CPU) processed `Ausgabefarbprofil' on GPU, blended on GPU [export]
13,007575 [opencl memory] device 0: 0 bytes in use
13,043600 [dev_pixelpipe] took 0,101 secs (0,335 CPU) processed `Gamma' on CPU, blended on CPU [export]
13,043661 [opencl_profiling] profiling device 0 ('GeForce GTX 1050 Ti'):
13,043665 [opencl_profiling] spent 0,1912 seconds in [Write Image (from host to device)]
13,043668 [opencl_profiling] spent 0,0047 seconds in rawprepare_1f
13,043671 [opencl_profiling] spent 0,0053 seconds in whitebalance_1f
13,043674 [opencl_profiling] spent 0,0049 seconds in highlights_1f_clip
13,043676 [opencl_profiling] spent 0,0195 seconds in ppg_demosaic_green
13,043679 [opencl_profiling] spent 0,0233 seconds in ppg_demosaic_redblue
13,043681 [opencl_profiling] spent 0,0017 seconds in border_interpolate
13,043684 [opencl_profiling] spent 0,5840 seconds in [Read Image (from device to host)]
13,043687 [opencl_profiling] spent 0,0220 seconds in denoiseprofile_precondition
13,043689 [opencl_profiling] spent 0,0095 seconds in denoiseprofile_init
13,043692 [opencl_profiling] spent 2,1725 seconds in denoiseprofile_dist
13,043694 [opencl_profiling] spent 0,9731 seconds in denoiseprofile_horiz
13,043697 [opencl_profiling] spent 2,2716 seconds in denoiseprofile_vert
13,043699 [opencl_profiling] spent 5,4051 seconds in denoiseprofile_accu
13,043702 [opencl_profiling] spent 0,0336 seconds in denoiseprofile_finish
13,043704 [opencl_profiling] spent 0,0324 seconds in basecurve_lut
13,043706 [opencl_profiling] spent 0,0294 seconds in colorin_unbound
13,043709 [opencl_profiling] spent 0,0342 seconds in sharpen_hblur
13,043711 [opencl_profiling] spent 0,0195 seconds in sharpen_vblur
13,043713 [opencl_profiling] spent 0,0449 seconds in sharpen_mix
13,043716 [opencl_profiling] spent 0,0595 seconds in colorout
13,043718 [opencl_profiling] spent 11,9418 seconds totally in command queue (with 0 events missing)
13,043726 [dev_process_export] pixel pipeline processing took 12,392 secs (8,381 CPU)
Matthias Bodenbinder
7 years ago
Permalink
...
So I tried the smaller bench.SRW on the console and that is running fine without memory issue which nvidia-smi is showing this status before the try:

12# nvidia-smi
Thu May 24 18:13:32 2018
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 396.24 Driver Version: 396.24 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 105... Off | 00000000:01:00.0 On | N/A |
| 20% 47C P0 N/A / 75W | 357MiB / 4032MiB | 0% Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 3161 G /usr/lib/xorg-server/Xorg 140MiB |
| 0 3204 G /usr/bin/gnome-shell 214MiB |
+-----------------------------------------------------------------------------+

The bigger Canon 5Ds file is not working without memory issue anymore. I need to reboot to get it back.

Matthias
Ulrich Pegelow
7 years ago
Permalink
Post by Matthias Bodenbinder
37,787315 [opencl_summary_statistics] device 'GeForce GTX 1050 Ti' (0): peak memory usage 3807805440 bytes
37,787326 [opencl_summary_statistics] device 'GeForce GTX 1050 Ti' (0): 499 out of 500 events were successful and 1 events lost
Peak memory usage is at 3632MB which is 400MB below your GPUs total of
4032MB, as to be expected due to your setting of
opencl_memory_headroom=400.

Intermediately you have

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 396.24 Driver Version: 396.24
|
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile
Uncorr. ECC |
Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util
Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 105... Off | 00000000:01:00.0 On |
N/A |
| 21% 47C P0 N/A / 75W | 365MiB / 4032MiB | 0%
Default |
+-------------------------------+----------------------+----------------------+

which is really close to 400MB. Note that darktable (as any program
using opencl) has no means to detect the free amount of GPU memory at
any time, so it relies on having access to all minus opencl_memory_headroom.

I suggest to set a significantly higher value for opencl_memory_headroom
like 800 or even higher in your case. You have a lot of memory, no
problem to be a bit more permissive.

Ulrich
Matthias Bodenbinder
7 years ago
Permalink
...
I did set opencl_memory_headroom=700 now.

But for me the issue is more: Why does it work with opencl_memory_headroom=400 for the first couple of cycles and then I get the issue. Then I increase the value to opencl_memory_headroom=500 which fixes it for a while before it happens again and I need to go to opencl_memory_headroom=600 and so forth. That means that the memory problem is getting worse over time. But why? Is that a memory leak?
Ulrich Pegelow
7 years ago
Permalink
Post by Matthias Bodenbinder
But for me the issue is more: Why does it work with opencl_memory_headroom=400 for the first couple of cycles and then I get the issue. Then I increase the value to opencl_memory_headroom=500 which fixes it for a while before it happens again and I need to go to opencl_memory_headroom=600 and so forth. That means that the memory problem is getting worse over time. But why? Is that a memory leak?
Well, never say never. But least here I cannot detect a leak. I'll do
some further tests with a 5Ds raw. Is it one of the two XMPs that you
have attached which shows the problem most reproducibly?

Ulrich
Matthias Bodenbinder
7 years ago
Permalink
...
Hello Ulrich

I very much appreciate that you are looking into this. I just verfified this morning that the issue is reproducible and I have a more automated way of reproducing it.

I have put all relevant files on dropbox: https://www.dropbox.com/sh/yarghqgncirjd0w/AADUNyFGaGUpyBTDil_qaxsUa?dl=0

There you will find a script test-CR2.sh which infinitly calls darktable-cli with the big 5Ds file. And it creates a log file for this. I have attached 2 of those log files.

The first log file 5Ds-infite-30762-15-runs-no-issue.log.gz is from the start of the test, where darktable-cli is runnming just fine 15 times. Then I enter the darktable GUI and export 26 of my Canon 6D files. I exit the GUI and start test-CR2.sh again. I had to repeat this cycle for 4 times before the issue occurs. The result is shown in the second logfile: 5Ds-infite-2750-3-runs-with-issue.log.gz

I have been using opencl_memory_headroom=500 for this test. I am using DT 2.4.3. In addition I have firefox 60.0.1, thunderbird 52.8 and three gnome-terminals open. My system is running on Manjaro testing with kernel 4.16.11.

Matthias
Ulrich Pegelow
7 years ago
Permalink
I'll have a deeper look into this. In the meantime it would be helpful
to learn from you in more detail what nvidia-smi tells you about the GPU
memory usage in the different steps of your test, especially before and
after you get your test run failing for the first time.

For your information here on my system it typically looks like this:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 384.111 Driver Version: 384.111
|
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile
Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util
Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 106... Off | 00000000:01:00.0 On |
N/A |
| 3% 50C P8 10W / 120W | 173MiB / 6065MiB | 0%
Default |
+-------------------------------+----------------------+----------------------+


+-----------------------------------------------------------------------------+
| Processes: GPU
Memory |
| GPU PID Type Process name Usage
|
|=============================================================================|
| 0 3563 G /usr/bin/X
93MiB |
| 0 3977 G kwin_x11
19MiB |
| 0 3992 G /usr/bin/krunner
1MiB |
| 0 3994 G /usr/bin/plasmashell
53MiB |
| 0 4035 G /usr/bin/kgpg
2MiB |
+-----------------------------------------------------------------------------+

When darkttable GUI is running and idle it has an additional permanent
requirement of about 63MB. During image processing GPU memory of
darktable goes steeply up but always returns to the idle value
afterwards. When terminating darktable there should be no traces of it
left in the output of nvidia-smi.

Ulrich
...
Matthias Bodenbinder
7 years ago
Permalink
Hello Ulrich,

I did more testing around starting / stopping DT. This alone increase the GPU memory consumption continuously by 30 MiB per start/stop activity. It starts with 107 MiB at the beginning and ends at 271 MiB after five start/stop cylces. This is with gnome-shell.

I did the same test with Plasma5 and it does happen there. Looks like a memory leak in gnome-shell. I know that gnome-shell had a memory leak but that it is also affecting GPU memory is new to me.

And I also thought that gnome-shell 3.28.2-1 fixed the meory leak. But obviously it doesnt.

I guess I need to open a bug report for gnome-shell.

But anyways, thanks for your help so far.

Matthias
Peter McD
7 years ago
Permalink
...
Post by Ulrich Pegelow
Post by Matthias Bodenbinder
37,787315 [opencl_summary_statistics] device 'GeForce GTX 1050 Ti'
(0): peak memory usage 3807805440 bytes
37,787326 [opencl_summary_statistics] device 'GeForce GTX 1050 Ti'
(0): 499 out of 500 events were successful and 1 events lost
Peak memory usage is at 3632MB which is 400MB below your GPUs total of
4032MB, as to be expected due to your setting of
opencl_memory_headroom=400.
...

Is there a rule of thumb for headroom settings?
I have an AMD RX 460 with 2GB memory.

cu
Peter
Bernhard
7 years ago
Permalink
Post by Peter McD
...
Is there a rule of thumb for headroom settings?
I have an AMD RX 460 with 2GB memory.
AMD is different from NVidia
https://www.darktable.org/usermanual/en/darktable_and_opencl_amd.html
--
regards
Bernhard

https://www.bilddateien.de
Matthias Bodenbinder
7 years ago
Permalink
Post by Peter McD
Is there a rule of thumb for headroom settings?
Very good question.

Specifically because I do not see a performance difference between
opencl_memory_headroom=1000 and opencl_memory_headroom=400. The pipeline processing always takes the same time (with a 3 % variance). So why shoulkd I bother with small values of opencl_memory_headroom?

Matthias
Ulrich Pegelow
7 years ago
Permalink
Post by Matthias Bodenbinder
Post by Peter McD
Is there a rule of thumb for headroom settings?
Very good question.
Specifically because I do not see a performance difference between
opencl_memory_headroom=1000 and opencl_memory_headroom=400. The pipeline processing always takes the same time (with a 3 % variance). So why shoulkd I bother with small values of opencl_memory_headroom?
In fact that depends a lot on your total amount of GPU memory.

When darktable's opencl support was implemented, systems with 1GB were
the norm. Typically only about 700MB of that would be available to
darktable, the rest needed to be left alone for system purposes (which
we found out by trial and error).

There were two corner cases to be taken into account:

* a too small value of opencl_memory_headroom would lead to out of
memory situations in darktable, processing would fall back to the CPU
which tends to be much slower.

* a too high value of opencl_memory_headroom would force darktable to go
into tile-wise processing much too often. This also costs performance.

As a reasonable compromise we now have a setting of 400 as default. That
should reserve enough space for the system and still prevent tiling in
most cases.

If your system has lots of GPU memory the opencl_memory_headroom setting
can be much more relaxed = higher. When doing some tests in the course
of this thread on my computer I reached peak memory usage of about 4GB
during export of 5Ds images without any tiling (I have 6GB total GPU
memory). In my case I may therefore set opencl_memory_headroom to 2000
without getting any performance difference.

Even on a 4GB system a setting of 1000 might be totally fine. Tiling
will happen a bit more frequently, but in the end this will only affect
few modules with high memory demand (e.g. profiled denoise) and large
images. On a system with 2GB or less the situation will probably look a
bit different, though.

Ulrich
Peter McD
7 years ago
Permalink
...
OK, I'll set headroom to 400 and see what will happen with my AMD
RX-460, 2GB memory.

Thank's
Peter

Bernhard
7 years ago
Permalink
Post by Ulrich Pegelow
You should check on your system with nvidia-smi (typically comes as
part of the nvidia-compute package). The tool will tell you the amount
of used GPU memory. You should be able to see if it differs before and
after running darktable.
I use this command

|~ $ watch -n 0,1 nvidia-smi|

which reads the values 10 times per second so I can easily see peak
values ...
--
regards
Bernhard

https://www.bilddateien.de
Loading...