Triton.Lang How to handle Block sizesbsaoptima
I am trying to use triton-lang to perform a simple element-wise dot product between a column vector and a matrix that both have complex value. I can make the code work if I don't specify block_sizes but I can't figure out how to cut my grid and how to handle my pointers. I somewhat understand the theory on how it should work but I can't make it work. def cdot(x: torch.Tensor, y: torch.Tensor): return x * y def cdot_triton(x: torch.Tensor, y: torch.Tensor, BLOCK_SIZE): # preallocate the
12 April 2024 at 16:53

Triton.Lang How to handle Block sizes

I am trying to use triton-lang to perform a simple element-wise dot product between a column vector and a matrix that both have complex value. I can make the code work if I don't specify block_sizes but I can't figure out how to cut my grid and how to handle my pointers. I somewhat understand the theory on how it should work but I can't make it work.

def cdot(x: torch.Tensor, y: torch.Tensor):
    return x * y

def cdot_triton(x: torch.Tensor, y: torch.Tensor, BLOCK_SIZE):
    # preallocate the output
    z = torch.empty_like(y)

    # check arguments
    assert x.is_cuda and y.is_cuda and z.is_cuda

    # get vector size
    N = z.numel()

    # 1D launch kernel where each block gets its own program
    grid = lambda meta: (N // BLOCK_SIZE, N // BLOCK_SIZE)

    # launch the kernel
    cdot_kernel[grid](x.real, x.imag, y.real, y.imag, z.real, z.imag, N, BLOCK_SIZE)

    return z

@triton.jit
def cdot_kernel(
    x_real_ptr,
    x_imag_ptr,
    y_real_ptr,
    y_imag_ptr,
    z_real_ptr,
    z_imag_ptr,
    N: tl.constexpr,  # Size of the vector
    BLOCK_SIZE: tl.constexpr,  # Number of elements each program should process
):
    row = tl.program_id(0)
    col = tl.arange(0, 2*BLOCK_SIZE)


    if row < BLOCK_SIZE:
        idx = row * BLOCK_SIZE + col
        x_real = tl.load(x_real_ptr + 2*row)
        x_imag = tl.load(x_imag_ptr + 2*row)
        y_real = tl.load(y_real_ptr + 2*idx, mask=col<BLOCK_SIZE, other=0)
        y_imag = tl.load(y_imag_ptr + 2*idx, mask=col<BLOCK_SIZE, other=0)

        z_real = x_real * y_real - x_imag * y_imag
        z_imag = x_real * y_imag + x_imag * y_real

        tl.store(z_real_ptr + 2*idx, z_real, mask=col<BLOCK_SIZE)
        tl.store(z_imag_ptr + 2*idx, z_imag, mask=col<BLOCK_SIZE)
        
# ===========================================
# Test kernel
# ===========================================

size = 4
dtype = torch.complex64
x = torch.rand((size, 1), device='cuda', dtype=dtype)
y = torch.rand((size, size), device='cuda', dtype=dtype)


out_dot = cdot(x,y)
out_kernel = cdot_triton(x,y, BLOCK_SIZE=2)

This is the output:

tensor([[-0.1322+1.1461j, -0.1098+0.8015j,  0.2948+1.2155j, -0.1326+0.6076j],
        [-0.3687+0.4646j,  0.2349+0.5802j,  0.0568+0.9461j, -0.0457+0.3213j],
        [ 0.0523+0.9351j,  0.4409+0.5076j,  0.3956+0.4018j,  0.6230+0.9270j],
        [-0.3503+0.7194j, -0.3742+0.2311j, -0.3353+0.3884j, -0.3478+0.6724j]],
       device='cuda:0')
tensor([[-0.1322+1.1461j, -0.1098+0.8015j,  0.0617+1.0408j, -0.1988+0.4788j],
        [ 0.1147+0.2296j,  0.0686+0.1161j,  0.0647+0.4044j,  0.0795+0.6407j],
        [-0.2396+0.6326j, -0.3587+0.5878j, -0.1563+0.4028j, -0.2933+0.3294j],
        [-0.1214+0.3678j,  0.0440+0.9951j,  0.3342+1.1360j,  0.6796+0.6590j]],
       device='cuda:0')

As you can see only the 2 first values of the top row are accurate.

Any ideas on how I can make this element-wise dot product work?

Many thanks!

GPU Programming, CUDA or OpenCL or? [closed]

What is the best way to do programming for GPU?

I know:

CUDA is very good, much developer support and very nice zo debug, but only on NVidia Hardware
OpenCL is very flexible, run on NVidia, AMD and Intel Hardware, run on Accellerators, GPU and CPU but as far as I know not supported anymore by NVidia.
Coriander (https://github.com/hughperkins/coriander) which converts CUDA to OpenCL
HIP https://github.com/ROCm-Developer-Tools/HIP is made by AMD to have a possibility to write in a way to convert to AMD and NVidia CUDA. It also can convert CUDA to HIP.

OpenCL would my prefered way, I want to be very flexible in hardware support. But if not longer supported by NVidia, it is a knockout. HIP sounds then best to me with different released files. But how will be the support of Intels soon coming hardware?

Are there any other options? Important is for me many supported hardeware, long term support, so that can be compiled in some years also and manufacture independant. Additional: Should be able to use more than obe compiler, on Linux and Windows supported.

Sorting and Removing Elements from the Structure of Arrays (SOA) in C++

C++ iterators and algorithms work well for containers, but can we sort the Structure of Arrays?

What is the parameter for CLI YOLOv8 predict to use Intel GPU?

I installed OpenVINO dependencies and converted the model to OpenVINO format. I have OpenCL device available:

$ clinfo -l
Platform #0: Intel(R) OpenCL HD Graphics
 `-- Device #0: Intel(R) Graphics [0xa7a0]

Trying to run yolo predict on a GPU, e.g.

yolo predict model=openvino_model source='samples/*.jpg' device=gpu
yolo predict model=openvino_model source='samples/*.jpg' device=0

always results in Invalid CUDA device requested. It runs fine on the CPU, though.

What parameter should I use for yolo to find and use OpenCL device #0?

No GPU EC2 instances associated with AWS Batch

I need to set up GPU backed instances on AWS Batch.

Here's my .yaml file:

  GPULargeLaunchTemplate:
    Type: AWS::EC2::LaunchTemplate
    Properties:
      LaunchTemplateData:
        UserData:
          Fn::Base64:
            Fn::Sub: |
              MIME-Version: 1.0
              Content-Type: multipart/mixed; boundary="==BOUNDARY=="

              --==BOUNDARY==
              Content-Type: text/cloud-config; charset="us-ascii"

              runcmd:
                - yum install -y aws-cfn-bootstrap
                - echo ECS_LOGLEVEL=debug >> /etc/ecs/ecs.config
                - echo ECS_IMAGE_CLEANUP_INTERVAL=60m >> /etc/ecs/ecs.config
                - echo ECS_IMAGE_MINIMUM_CLEANUP_AGE=60m >> /etc/ecs/ecs.config
                - /opt/aws/bin/cfn-init -v --region us-west-2 --stack cool_stack --resource LaunchConfiguration
                - echo "DEVS=/dev/xvda" > /etc/sysconfig/docker-storage-setup
                - echo "VG=docker" >> /etc/sysconfig/docker-storage-setup
                - echo "DATA_SIZE=99%FREE" >> /etc/sysconfig/docker-storage-setup
                - echo "AUTO_EXTEND_POOL=yes" >> /etc/sysconfig/docker-storage-setup
                - echo "LV_ERROR_WHEN_FULL=yes" >> /etc/sysconfig/docker-storage-setup
                - echo "EXTRA_STORAGE_OPTIONS=\"--storage-opt dm.fs=ext4 --storage-opt dm.basesize=64G\"" >> /etc/sysconfig/docker-storage-setup
                - /usr/bin/docker-storage-setup
                - yum update -y
                - echo "OPTIONS=\"--default-ulimit nofile=1024000:1024000 --storage-opt dm.basesize=64G\"" >> /etc/sysconfig/docker
                - /etc/init.d/docker restart

              --==BOUNDARY==--
      LaunchTemplateName: GPULargeLaunchTemplate

  GPULargeBatchComputeEnvironment:
    DependsOn:
      - ComputeRole
      - ComputeInstanceProfile
    Type: AWS::Batch::ComputeEnvironment
    Properties:
      Type: MANAGED
      ComputeResources:
        ImageId: ami-GPU-optimized-AMI-ID
        AllocationStrategy: BEST_FIT_PROGRESSIVE
        LaunchTemplate:
          LaunchTemplateId:
            Ref: GPULargeLaunchTemplate
          Version:
            Fn::GetAtt:
              - GPULargeLaunchTemplate
              - LatestVersionNumber
        InstanceRole:
          Ref: ComputeInstanceProfile
        InstanceTypes:
          - g4dn.xlarge
        MaxvCpus: 768
        MinvCpus: 1
        SecurityGroupIds:
          - Fn::GetAtt:
              - ComputeSecurityGroup
              - GroupId
        Subnets:
          - Ref: ComputePrivateSubnetA
        Type: EC2
        UpdateToLatestImageVersion: True

  MyGPUBatchJobQueue:
    Type: AWS::Batch::JobQueue
    Properties:
      ComputeEnvironmentOrder:
        - ComputeEnvironment:
            Ref: GPULargeBatchComputeEnvironment
          Order: 1
      Priority: 5
      JobQueueName: MyGPUBatchJobQueue
      State: ENABLED

  MyGPUJobDefinition:
    Type: AWS::Batch::JobDefinition
    Properties:
      Type: container
      ContainerProperties:
        Command:
          - "/opt/bin/python3"
          - "/opt/bin/start.py"
          - "--retry_count"
          - "Ref::batchRetryCount"
          - "--retry_limit"
          - "Ref::batchRetryLimit"
        Environment:
          - Name: "Region"
            Value: "us-west-2"
          - Name: "LANG"
            Value: "en_US.UTF-8"
        Image:
          Fn::Sub: "cool_1234_abc.dkr.ecr.us-west-2.amazonaws.com/my-image"
        JobRoleArn:
          Fn::Sub: "arn:aws:iam::cool_1234_abc:role/ComputeRole"
        Memory: 16000
        Vcpus: 1
        ResourceRequirements:
          - Type: GPU
            Value: '1'
      JobDefinitionName: MyGPUJobDefinition
      Timeout:
        AttemptDurationSeconds: 500

When I start a job, the job is stuck in RUNNABLE state forever, then I did these:

When I swapped the instance type to be normal CPU types, redeploy the CF stack, submit a job and the job could be run and succeeded fine, so must be something missing/wrong with the way I use these GPU instance types on AWS Batch;
Then I found this post, so I added an ImageId field in my ComputeEnvironment with a known GPU optimized AMI, but still no luck;
I did a side by side comparison for the jobs between the working CPU AWS Batch and non-working GPU AWS Batch via running aws batch describe-jobs --jobs AWS_BATCH_JOB_EXECUTION_ID --region us-west-2, I found that what's missing between them is: containerInstanceArn and taskArn where in the non-working GPU instance, these two fields are just missing.
I found that in the ASG (Auto Scaling Group) created by the Compute Environment, this GPU instance is in this ASG, but when I go to the ECS, chose this GPU cluster, there's no container instances associated with it, unlike the working CPU ones, where the ECS cluster has container instances within.

Any ideas how to fix this would be greatly appreciated!

Cuda 12 + tf-nightly 2.12: Could not find cuda drivers on your machine, GPU will not be used, while every checking is fine and in torch it works

tf-nightly version = 2.12.0-dev2023203
Python version = 3.10.6
CUDA drivers version = 525.85.12
CUDA version = 12.0
Cudnn version = 8.5.0
I am using Linux (x86_64, Ubuntu 22.04)
I am coding in Visual Studio Code on a venv virtual environment

I am trying to run some models on the GPU (NVIDIA GeForce RTX 3050) using tensorflow nightly 2.12 (to be able to use Cuda 12.0). The problem that I have is that apparently every checking that I am making seems to be correct, but in the end the script is not able to detect the GPU. I've dedicated a lot of time trying to see what is happening and nothing seems to work, so any advice or solution will be more than welcomed. The GPU seems to be working for torch as you can see at the very end of the question.

I will show some of the most common checkings regarding CUDA that I did (Visual Studio Code terminal), I hope you find them useful:

Check CUDA version:

$ nvcc --version

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Fri_Jan__6_16:45:21_PST_2023
Cuda compilation tools, release 12.0, V12.0.140
Build cuda_12.0.r12.0/compiler.32267302_0

Check if the connection with the CUDA libraries is correct:

$ echo $LD_LIBRARY_PATH
```
/usr/cuda/lib
```

Check nvidia drivers for the GPU and check if GPU is readable for the venv:

$ nvidia-smi

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.85.12    Driver Version: 525.85.12    CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  On   | 00000000:01:00.0  On |                  N/A |
| N/A   40C    P5     6W /  20W |     46MiB /  4096MiB |     22%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      1356      G   /usr/lib/xorg/Xorg                 45MiB |
+-----------------------------------------------------------------------------+

Add cuda/bin PATH and Check it:

$ export PATH="/usr/local/cuda/bin:$PATH"

$ echo $PATH

/usr/local/cuda-12.0/bin:/home/victus-linux/Escritorio/MasterThesis_CODE/to_share/venv_master/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin:/snap/bin

Custom function to check if CUDA is correctly installed: [function by Sherlock]

function lib_installed() { /sbin/ldconfig -N -v $(sed 's/:/ /' <<< $LD_LIBRARY_PATH) 2>/dev/null | grep $1; }
function check() { lib_installed $1 && echo "$1 is installed" || echo "ERROR: $1 is NOT installed"; }
check libcuda
check libcudart

libcudart.so.12 -> libcudart.so.12.0.146
        libcuda.so.1 -> libcuda.so.525.85.12
        libcuda.so.1 -> libcuda.so.525.85.12
        libcudadebugger.so.1 -> libcudadebugger.so.525.85.12
libcuda is installed
        libcudart.so.12 -> libcudart.so.12.0.146
libcudart is installed

Custom function to check if Cudnn is correctly installed: [function by Sherlock]

function lib_installed() { /sbin/ldconfig -N -v $(sed 's/:/ /' <<< $LD_LIBRARY_PATH) 2>/dev/null | grep $1; }
function check() { lib_installed $1 && echo "$1 is installed" || echo "ERROR: $1 is NOT installed"; }
check libcudnn

        libcudnn_cnn_train.so.8 -> libcudnn_cnn_train.so.8.8.0
        libcudnn_cnn_infer.so.8 -> libcudnn_cnn_infer.so.8.8.0
        libcudnn_adv_train.so.8 -> libcudnn_adv_train.so.8.8.0
        libcudnn.so.8 -> libcudnn.so.8.8.0
        libcudnn_ops_train.so.8 -> libcudnn_ops_train.so.8.8.0
        libcudnn_adv_infer.so.8 -> libcudnn_adv_infer.so.8.8.0
        libcudnn_ops_infer.so.8 -> libcudnn_ops_infer.so.8.8.0
libcudnn is installed

So, once I did these previous checkings I used a script to evaluate if everything was finally ok and then the following error appeared:

import tensorflow as tf

print(f'\nTensorflow version = {tf.__version__}\n')
print(f'\n{tf.config.list_physical_devices("GPU")}\n')

2023-03-02 12:05:09.463343: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.
2023-03-02 12:05:09.489911: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.
2023-03-02 12:05:09.490522: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-03-02 12:05:10.066759: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT

Tensorflow version = 2.12.0-dev20230203

2023-03-02 12:05:10.748675: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:996] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2023-03-02 12:05:10.771263: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1956] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform.
Skipping registering GPU devices...

[]

Extra check: I tried to run a checking script on torch and in here it worked so I guess the problem is related with tensorflow/tf-nightly

import torch

print(f'\nAvailable cuda = {torch.cuda.is_available()}')

print(f'\nGPUs availables = {torch.cuda.device_count()}')

print(f'\nCurrent device = {torch.cuda.current_device()}')

print(f'\nCurrent Device location = {torch.cuda.device(0)}')

print(f'\nName of the device = {torch.cuda.get_device_name(0)}')

Available cuda = True

GPUs availables = 1

Current device = 0

Current Device location = <torch.cuda.device object at 0x7fbe26fd2ec0>

Name of the device = NVIDIA GeForce RTX 3050 Laptop GPU

Please, if you know something that might help solve this issue, don't hesitate on telling me.

How do I parallelize a set of matrix multiplications

Consider the following operation, where I take 20 x 20 slices of a larger matrix and dot product them with another 20 x 20 matrix:

import numpy as np

a = np.random.rand(10, 20)
b = np.random.rand(20, 1000)

ans_list = []

for i in range(980):
    ans_list.append(
        np.dot(a, b[:, i:i+20])
    )

I know that NumPy parallelizes the actual matrix multiplication, but how do I parallelize the outer for loop so that the individual multiplications are run at the same time instead of sequentially?

Additionally, how would I go about it if I wanted to do the same using a GPU? Obviously, I'll use CuPy instead of NumPy, but how do I submit the multiple matrix multiplications to the GPU either simultaneously or asynchronously?

PS: Please note that the sliding windows above are an example to generate multiple matmuls. I know one solution (shown below) in this particular case is to use NumPy built-in sliding windows functionality, but I'm interested in knowing the optimal way to run an arbitrary set of matmuls in parallel (optionally on a GPU), and not just a faster solution for this particular example.

windows = np.lib.stride_tricks.sliding_window_view(b, (20, 20)).squeeze()
ans_list = np.dot(a, windows)

Dispplay value in gpu memory when debugging using cuda-gdb with vscode

I have a cuda program and the code are:

#include<stdio.h>
#include<stdlib.h>
#include<time.h>

#include<cuda_runtime.h>
#include<device_launch_parameters.h>

#define N 671078400

__global__ void v_add(int *a,int *b){
    int i=blockDim.x*blockIdx.x+threadIdx.x;
    if(i<N){
        b[i]=a[i]+1;
    }
    return;
}

int main(){
    

    int *a=NULL,*gpu_a=NULL,*gpu_b=NULL;
    a=(int*)malloc(sizeof(int)*N);

    dim3 grid;
    dim3 block;

    grid.x=655350;
    block.x=1024;

    for(unsigned i=0;i<N;i++){
        a[i]=i;
    }

    cudaMalloc((void **)&gpu_a,sizeof(int)*N);
    cudaMalloc((void **)&gpu_b,sizeof(int)*N);

    cudaMemcpy(gpu_a,a,sizeof(int)*N,cudaMemcpyHostToDevice);
    v_add<<<grid,block>>>(gpu_a,gpu_b);
    cudaDeviceSynchronize();

    cudaMemcpy(a,gpu_a,sizeof(int)*N,cudaMemcpyDeviceToHost);

    free(a);
    cudaFree(gpu_a);

    return 0;
}

This program has no problem. When I debugging this program using cuda-gdb in vscode nsight extension with breakpoint at line cudaMemcpy(a,gpu_a,sizeof(int)*N,cudaMemcpyDeviceToHost);, I cannot view values in gpu_a. Just as the following image 1 shows: image 1. However, if I put the breakpoint at line return; in kernel function __global__ void v_add(int *a,int *b), I can see the right value in gpu_a which is named as a in kernel function __global__ void v_add(int *a,int *b): image 2.

My question is when I put breakpoint out of the kernel function like previously mentioned at line cudaMemcpy(a,gpu_a,sizeof(int)*N,cudaMemcpyDeviceToHost);, how can I view the value in gpu memory just like I put the breakpoint at kernel function in vscode when debugging using nsight extension and cuda-gdb?

Transfer Learning with TensorFlow on Intel® Arc™ GPUs

Fast and Easy Training and Inference Using Intel® Consumer GPUs and Windows* Subsystem for Linux 2

Transfer Learning with TensorFlow on Intel® Arc™ GPUs

Fast and Easy Training and Inference Using Intel® Consumer GPUs and Windows* Subsystem for Linux 2

Generative AI Playground: LLMs with Camel-5b and Open LLaMA 3B on the Latest Intel® GPU

This article explores the use of Large Language Models (LLMs) in various applications, such as chatbots, code generation, and debugging.

Qt QQuickItem render performance

i am trying to develop a qml app that renders a lot of triangles. Right now i'm trying to render 200000. My graphics card is rtx2060, it should be able to render like millions of triangles at 60fps yet my application struggles with 200000 triangles and i get like 10 fps. I know it's gpu bound because my cpu usage is very low and gpu usage at 100 percent all the time. What can i do to increase performance?

Here is the code i'm using:

main.cpp

#include <QGuiApplication>
#include <QQmlApplicationEngine>
#include <QQuickWindow>
#include "DenemeClass.h"

int main(int argc, char *argv[])
{
#if QT_VERSION < QT_VERSION_CHECK(6, 0, 0)
    QCoreApplication::setAttribute(Qt::AA_EnableHighDpiScaling);
#endif
    QGuiApplication app(argc, argv);

    qmlRegisterType<DenemeClass>("com.deneme", 1, 0, "DenemeClass");

    QQmlApplicationEngine engine;
    const QUrl url(QStringLiteral("qrc:/main.qml"));
    QObject::connect(&engine, &QQmlApplicationEngine::objectCreated,
        &app, [url](QObject *obj, const QUrl &objUrl) {
            if (!obj && url == objUrl)
                QCoreApplication::exit(-1);
        }, Qt::QueuedConnection);
    engine.load(url);

    return app.exec();
}

DenemeClass.h

#ifndef DENEMECLASS_H
#define DENEMECLASS_H
#include <QQuickItem>

class DenemeClass : public QQuickItem
{
    Q_OBJECT
public:
    DenemeClass();
    QSGNode* updatePaintNode(QSGNode* oldNode, UpdatePaintNodeData* data) override;
    int64_t lastDraw;
};

#endif // DENEMECLASS_H

DenemeClass.cpp

#include "DenemeClass.h"
#include <QSGGeometryNode>
#include <QSGFlatColorMaterial>
#include <QSGGeometry>
#include <iostream>
DenemeClass::DenemeClass()
{
    setFlag(ItemHasContents, true);
}

QSGNode* DenemeClass::updatePaintNode(QSGNode* oldNode, UpdatePaintNodeData* data){

    int64_t draw_t = std::chrono::duration_cast<std::chrono::milliseconds>(std::chrono::steady_clock::now().time_since_epoch()).count();
    std::cout << (1000.0 / double(draw_t - lastDraw)) << "FPS" << std::endl;
    lastDraw = draw_t;

    int rectCount = 100000;

    std::cout << "width " << width() << std::endl;

    QSGGeometryNode* node = reinterpret_cast<QSGGeometryNode*>(oldNode);
    if(!node){
        node = new QSGGeometryNode();
        node->setFlag(QSGNode::OwnsMaterial, true);
        node->setFlag(QSGNode::OwnsGeometry, true);
        QSGFlatColorMaterial* material = new QSGFlatColorMaterial;
        material->setColor("#ff0000");
        node->setMaterial(material);

        QSGGeometry* geometry = new QSGGeometry(QSGGeometry::defaultAttributes_Point2D(), rectCount * 6, 0, QSGGeometry::UnsignedIntType);
        geometry->setDrawingMode(QSGGeometry::DrawTriangles);
        node->setGeometry(geometry);


    QSGGeometry::Point2D* pts = geometry->vertexDataAsPoint2D();

    for(int i = 0; i < rectCount; ++i){
        pts[i * 6 + 0].x = 0;
        pts[i * 6 + 0].y = 0;

        pts[i * 6 + 1].x = 0;
        pts[i * 6 + 1].y = height();

        pts[i * 6 + 2].x = width();
        pts[i * 6 + 2].y = height();

        pts[i * 6 + 3].x = width();
        pts[i * 6 + 3].y = height();

        pts[i * 6 + 4].x = width();
        pts[i * 6 + 4].y = 0;

        pts[i * 6 + 5].x = 0;
        pts[i * 6 + 5].y = 0;

    }


    }


    return node;


}

main.qml

import QtQuick 2.15
import QtQuick.Window 2.15
import com.deneme 1.0
Window {
    width: 640
    height: 480
    visible: true
    title: qsTr("Hello World")

    property double xx: 0
    NumberAnimation on xx{
        running: true
        loops: Animation.Infinite
        from: 0
        to: 100
        duration: 2000
    }

    Rectangle{
        anchors.fill: parent
        color: "yellow"
        anchors.margins: xx
    }

    DenemeClass{
        anchors.fill: parent
        anchors.margins: xx
    }

}

I tried changing opengl backend with QCoreApplication::setAttribute(Qt::AA_UseOpenGLES); but no luck.

Impossible to convert between the formats supported by the filter '...' - Error reinitializing filters

I am using this ffmpeg command(values removed for simplicity)

ffmpeg -hwaccel cuvid -c:v h264_cuvid -y -ss 1 -i "FILE0001.MOV" -ss 0 -i "GOPR0621.MP4" -filter_complex 
[0:v][1:v]
  midequalizer
[al];
[al]
  yadif
  lenscorrection
  scale
[vl];
[1:v]
  lenscorrection
  scale
[vr];
[vl][vr]
  hstack=shortest=1 
-an -c:v h264_nvenc -preset slow "output.mp4"

on a machine with a cuda graphics card.

I get

ffmpeg version N-90979-g08032331ac Copyright (c) 2000-2018 the FFmpeg developers
  built with gcc 7.3.0 (GCC)
  configuration: --enable-gpl --enable-version3 --enable-sdl2 --enable-bzlib --enable-fontconfig --enable-gnutls --enable-iconv --enable-libass --enable-libbluray --enable-libfreetype --enable-libmp3lame --enable-libopencore-amrnb --enable-libopencore-amrwb --enable-libopenjpeg --enable-libopus --enable-libshine --enable-libsnappy --enable-libsoxr --enable-libtheora --enable-libtwolame --enable-libvpx --enable-libwavpack --enable-libwebp --enable-libx264 --enable-libx265 --enable-libxml2 --enable-libzimg --enable-lzma --enable-zlib --enable-gmp --enable-libvidstab --enable-libvorbis --enable-libvo-amrwbenc --enable-libmysofa --enable-libspeex --enable-libxvid --enable-libaom --enable-libmfx --enable-amf --enable-ffnvcodec --enable-cuvid --enable-d3d11va --enable-nvenc --enable-nvdec --enable-dxva2 --enable-avisynth
  libavutil      56. 18.100 / 56. 18.100
  libavcodec     58. 19.100 / 58. 19.100
  libavformat    58. 13.101 / 58. 13.101
  libavdevice    58.  4.100 / 58.  4.100
  libavfilter     7. 21.100 /  7. 21.100
  libswscale      5.  2.100 /  5.  2.100
  libswresample   3.  2.100 /  3.  2.100
  libpostproc    55.  2.100 / 55.  2.100
[mov,mp4,m4a,3gp,3g2,mj2 @ 00000254a8afc0c0] st: 0 edit list: 1 Missing key frame while searching for timestamp: 6006
[mov,mp4,m4a,3gp,3g2,mj2 @ 00000254a8afc0c0] st: 0 edit list 1 Cannot find an index entry before timestamp: 6006.
....
Stream mapping:
  Stream #0:0 (h264_cuvid) -> midequalizer:in0
  Stream #1:0 (h264) -> midequalizer:in1
  Stream #1:0 (h264) -> lenscorrection
  hstack -> Stream #0:0 (h264_nvenc)
  
Impossible to convert between the formats supported by the filter 'graph 0 input from stream 0:0' and the filter 'auto_scaler_0'
Error reinitializing filters!

The same command without CUDA works, ie

ffmpeg -y -ss 1 -i "FILE0001.MOV" -ss 0 -i "GOPR0621.MP4" -filter_complex 
[0:v][1:v]
  midequalizer
[al];
[al]
  yadif
  lenscorrection
  scale
[vl];
[1:v]
  lenscorrection
  scale
[vr];
[vl][vr]
  hstack=shortest=1 
-an "output.mp4"

How do I make it work on a Windows 10 machine with cuda?

How to keep WGPU RenderPipeline alive after storing it in a void pointer in c++?

I am trying to store a wgpu::RenderPipeline as a void pointer

pipeline.rendering_data (this is void*) = wgpuDeviceCreateRenderPipeline(device, &pipeline_desc);

However, when I do this I get this error:

thread '<unnamed>' panicked at 'assertion failed: `(left == right)`
  left: `0`,
  right: `1`: RenderPipeline[1] is no longer alive', /root/.cargo/git/checkouts/wgpu-53e70f8674b08dd4/011a4e2/wgpu-core/src/hub.rs:348:9
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
fatal runtime error: failed to initiate panic, error 5
zsh: IOT instruction (core dumped)  ./jovial_test

Nvidia Chip Shortages Leave AI Startups Scrambling for Computing Power

Trimming profits, delaying launches, begging friends. Companies are going to extreme lengths to make do with shortages of GPUs, the chips at the heart of generative AI programs.

Why is this pytorch error saying that I have more memory allocated than my total memory?

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 5.86 GiB (GPU 0; 12.00 GiB total capacity; 12.84 GiB already allocated; 0 bytes free; 12.86 GiB reserved in total by PyTorch)

I'm not understanding how this is possible? There's more memory allocated and reserved than I have in total?

Since my final goal is to prevent this error in the first place, I'll throw this out there: torch.cuda.empty_cache() and cuda.close() didn't work either to free up space, it results in the same error (not even a change in the memory already allocated/reserved).

GPU is not recognised by tensorflow

i have an rtx 3050ti mobile version, however, it is not recognised by tensorflow, I have installed the cuda toolkit and the cudNN and set up the paths, but nothing works. Note: I am not using a virtual environment for this. Please advice.

import sys
import pandas as pd

import sklearn as sk
import tensorflow as tf

print(f"Tensor Flow Version: {tf.__version__}")
print()
print(f"Python {sys.version}")
print(f"Pandas {pd.__version__}")
print(f"Scikit-Learn {sk.__version__}")
gpu = len(tf.config.list_physical_devices('GPU'))>0
print("GPU is", "available" if gpu else "NOT AVAILABLE")

`Tensor Flow Version: 2.13.0

Python 3.10.11 (tags/v3.10.11:7d4cc5a, Apr 5 2023, 00:38:17) [MSC v.1929 64 bit (AMD64)] Pandas 2.0.0 Scikit-Learn 1.2.2 GPU is NOT AVAILABLE`

How to optimize onnx inference for dynamic input

I have example code for creating session for the ONNX model.

    so = ort.SessionOptions()
    so.inter_op_num_threads = 10
    so.intra_op_num_threads = 10
    session = ort.InferenceSession('example.onnx',
                                   sess_options=so,
                                   providers=['CUDAExecutionProvider'])

So, when I use input with the same size, like 200, It's okay and works very fast.

    for i in tqdm(range(1000)):
        array = np.zeros((1, 200, 80), dtype=np.float32)
        embeddings = session.run(output_names=['embs'], input_feed={'feats': array})

But, when I try to use some dynamic input, it starts working very slowly for the first few hundred or even thousand iterations, and then somehow optimizes and works the same as the first example.

    for i in tqdm(range(1000)):
        array = np.zeros((1, random.randint(200, 1000), 80), dtype=np.float32)
        embeddings = session.run(output_names=['embs'], input_feed={'feats': array})

Is there anyway to speed-up second example?

I tried batching, but because the difference between input sizes can be very different, it makes the output a little bit less accurate.

How to time profile/see when tasks are complete in wgpu (targeting web)?

Basically title. There are support for timestamp queries, but that requires specific device feature support (my laptop for example does not support it when targeting web). There is also the function Queue::onSubmittedWorkDone(callback) that throws an unreachable code error when I use any callback on wasm. Finally, device.poll(Maintain::wait) is explicitly a no-op on web according to the docs (for me, it never returns true when I try it). Is there any way to see whats in the Queue or check if it is empty? In the WebGPU api spec, onSubmittedWorkDone is an async function that returns a Promise. If that were the case, we could simply await until completion. I included an example with my attempt at using Queue::onSubmittedWorkDone(callback) in case it is user error (likely).

let mut encoder = driver.device.create_command_encoder(&wgpu::CommandEncoderDescriptor { label: None });
self.calculate_summary(&mut encoder);
self.color_map(&mut encoder); 
driver.queue.on_submitted_work_done(Self::done);
driver.queue.submit(Some(encoder.finish()));

fn done(){
\\\Ive tried just print statements, incrementing dummy variables, and even empty functions, but it doesn't work
}

Normal view