User Space Scheduling vs Kernel Space Scheduling - scheduling

I'm having a hard time understanding the difference between user space scheduling and kernel space scheduling. What are the advantages and disadvantages of one over the other as well.


Windows CPU Scheduler - very high kernel time [closed]

We are trying to understand how Windows CPU Scheduler works in order to optimize our applications to achieve maximum possible infrastructure/real work ratio. There's some things in xperf that we don't understand and would like to ask the community to shed some light on what's really happening.
We initially started to investigate these issues when we got reports that some servers were "slow" or "unresponsive".
Background information
We have a Windows 2012 R2 Server that runs our middleware infrastructure with the following specs.
We found concerning that 30% of CPU is getting wasted on kernel, so we started to dig deeper.
The server above runs "host" ~500 processes (as windows services), each of these "host" processes has an inner while loop with a ~250 ms delay (yuck!), and each of those "host" processes may have ~1..2 "child" processes that are executing the actual work.
While having the infinite loop with 250 ms delay between iterations, the actual useful work for the "host" application to execute may appear only every 10..15 seconds. So there's a lot of cycles wasted for unnecessary looping.
We are aware that design of the "host" application is sub-optimal, to say the least, as applied to our scenario. The application is getting changed to an event-based model which will not require the loop and therefore we expect a significant reduction of "kernel" time in CPU utilization graph.
However, while we were investigating this problem, we've done some xperf analysis which raised several general questions about Windows CPU Scheduler for which we were unable to find any clear/concise explanation.
What we don't understand
Below is the screenshot from one of xperf sessions.
You can see from the "CPU Usage (Precise)" that
There's 15 ms time slices, of which majority are under-utilized. The utilization of those slices is ~35-40%. So I assume that this in turn means that CPU gets utilized about ~35-40% of the time, yet the system's performance (let's say observable through casual tinkering around the system) is really sluggish.
With this we have this "mysterious" 30% kernel time cost, judged by the task manager CPU utilization graph.
Some CPU's are obviously utilized for the whole 15 ms slice and beyond.
As far as Windows CPU Scheduling on multiprocessor systems is concerned:
What causes 30% kernel cost? Context switching? Something else? What consideration should be made when applications are written to reduce this cost? Or even - achieve perfect utilization with minimal infrastructure cost (on multiprocessor systems, where number of processes is higher than the number of cores)
What are these 15 ms slices?
Why CPU utilization has gaps in these slices?
To diag the CPU usage issues, you should use Event Tracing for Windows (ETW) to capture CPU Sampling data (not precise, this is useful to detect hangs).
To capture the data, install the Windows Performance Toolkit, which is part of the Windows SDK.
Now run WPRUI.exe, select First Level, under Resource select CPU usage and click on start.
Now capture 1 minute of the CPU usage. After 1 minute click on Save.
Now analyze the generated ETL file with the Windows Performance Analyzer by drag & drop the CPU Usage (sampled) graph to the analysis pane and order the colums like you see in the picture:
Inside WPA, load the debug symbols and expand Stack of the SYSTEM process. In this demo, the CPU usage comes from the nVIDIA driver.

How is parallelism on a single thread/core possible?

Modern programming languages provide parallelism and concurrency mechanisms as first class citizens to their users. I understand how parallel algorithms are programmed and can well imagine how two threads on a multi-core CPU can run in parallel.
Yet, most of these platforms also support running parallel processes on a single thread.
Do these processes really run in parallel?
How, on an assembly level can two different routines be executed simultaneously on a single thread?
TLTR; : parallelism (in the sense of true simultanenous execution) on a single, non-hyperthreaded CPU core, is NOT possible.
Hardware (<- EDIT) Paralellism can be achieved at several levels. Ordered by decreasing granularity :
multi-threads ("Hyper-Threading", i.e. "HT")
(EDIT: I voluntarity omit the case of vectorized compuations where several ALUs can be driven by the same core)
Your question relates to running two software threads in cases 3. (in case HT is unavailable / disabled) or 4.
In both cases, the processes actually do NOT run in parallel. The user has an impression of simultaneity due to the extremely fast context switches performed at the CPU level, that tend to allocate, sequentially, the physical core (resp. thread) time to one or the other software thread
In both cases, those routines are simply not executed simultaneously, but sequentially
The relative priority allocated to each of those 2 routines can be set on various OSes by the "Priority" you give to the process, that will be handled by the OS's scheduler, which in turn will allocate CPU time.
To perform tests to better understand this topic, you may want to google "cpu affinity". This will let you run a two-threaded process on one physical single core of a multi-core CPU, and time the time taken by each of the threads, while modifying their priority, etc...
Yes, there is parallelism in each thread and you get it for free, no matter which programming language you use (although the amount of parallelism may vary).
It's called instruction-level parallelism. The details are quite complex and differ between different processor micro-architectures.
Computer Architecture: A Quantitative Approach is a brilliant book which includes a chapter on instruction-level parallelism and the book's examples teach how to think rationally about engineering.
Check out the following links for more information:

improve application performance and intelligence

I have a question which is related to application performance and intelligence.
I have created a window service, if I run it on a 3 different configuration machines. I want it to utilize appropriate resources of machine (CPU and memory).
Say Machine 1(M1) have single core with 1 GB ram.
Machine2 (M2) has two cores with 2 GB ram.
Machine3 (M3) has 4 cores with 4 GB ram.
Now when my service runs on it, it should utilize proper resource. Like if cpu use of machine is 1% it should go on user upto 50% or more. If it’s already 50% use only 30%. So do ram. But never cross a limit like 90% or something.
Basically I wrote a multithreaded service which right now don’t care about machine resources and keep on utilizing it. I want to include this intelligence in it.
Please help me out with your ideas.
As Archeg said, based on the number of processors, you can increase the thread count. But increasing the number of threads based on CPU activity is the wrong way to go about it.
Look at it this way - the CPU scheduler allocates time-slots at a millisecond granularity. If the load on the system from other processes is low, it will give your process more time. Period. If there are lots of processes, you will get time-slots less often. You shouldn't thrash it with more threads than necessary.
What you need to do is decide what you want to do. Is the service time-sensitive? If so, then in a heavily-loaded system, you have less CPU time to operate with, and in an idle system, you can use more CPU time within the same, say, second. Beware: If your service does I/O, maybe your service itself throttles how much CPU it can use.
With RAM, you could do something like, given how much free RAM the system has, switch algorithms to one that uses less processing or processes faster, but needs more memory (and vice-versa).
The point is that there's no 'service-independent' way to do this kind of intelligent scaling, besides better schedulers (which is something a lot of smart people have been looking at for many many years). You can however write services that are aware of current system constraints and change behavior accordingly.
Building high-performance .NET applications is significantly easier if you design with performance in mind. Make sure you develop a performance plan from the outset of your project. Never try to add performance as a post-build step. Also, use an iterative development process that incorporates constant measuring between iterations.
By following best practice design guidelines, you significantly increase your chances of creating a high-performance application.
Consider the following design guidelines:
Consider security and performance.
Partition your application logically.
Evaluate affinity.
Reduce round trips.
Avoid blocking on long-running tasks.
Use caching.
Avoid unnecessary exceptions.

Is using processor usage for the automated scheduling of processes a good practice?

I've been having a debate of sorts with a co-worker who suggested that we allow some cpu intensive processes in our enterprise to poll CPU usage and execute their tasks when the CPU usage is low. My counter-point was that while cpu usage in an ideal system would denote the level of system activity on a given server, in actuality it has too much inconsistency(peaks and dips over a short time) in a real system to be an effective indicator of when a cpu intensive process should run. And in addition I stated that the OS is designed to manage processor contention between threads and applications already. My suggestion was simply to run the process afterhours to avoid degrading the user's experience during the day.
My question is, can cpu usage be an effective indicator as to when processes should run in an enterprise setting? It would be a nice-to-know if I'm right, sort of right, or just being incorrect...
Edit: These applications are .NET services as well as SQL Server scheduled jobs.
No it isn't (so yes you are correct).
There are many ways that it can cause problems, off the top of my head:
The OS is attempting to balance
resource allocation. In order to do
this it has a scheduling algorithm
that uses a view of the current
resource usage. What you describe is
running a second scheduling algorithm
that will fight with the first one
(the OS scheduler) over allocation of
CPU. This can cause strange feedback
Just using process usage doesn't take
into account other resources such as
memory. When one process runs it
doesn't just display other processes
by using processor cycles. It's
working set of data is fighting with
other processes to be kept in memory
and out of swap. You can seriously
degrade the performance (and latency
especially) of other processes if you
activate your task because the CPU is
not in use and it causes their data
to be paged out.
Why reinvent the wheel? This is
precisely what priority levels /
idle-processing were invented for. If
you just want your process to take up
background CPU then set it running at
the lowest priority level and allow
the OS to schedule it.

Parallel programming on a Quad-Core and a VM?

I'm thinking of slowly picking up Parallel Programming. I've seen people use clusters with OpenMPI installed to learn this stuff. I do not have access to a cluster but have a Quad-Core machine. Will I be able to experience any benefit here? Also, if I'm running linux inside a Virtual machine, does it make sense in using OpenMPI inside a VM?
If your target is to learn, you don't need a cluster at all. Your quad-core (or any dual-core or even a single-cored) computer will be more than enough. The main point is to learn how to think "in parallel" and how to design your application.
Some important points are to:
Exploit different parallelism paradigms like divide-and-conquer, master-worker, SPMD, ... depending on data and tasks dependencies of what you want to do.
Chose different data division granularities to check the computation/communication ratio (in case of message passing), or to check the amount of serial execution because of mutual exclusion to memory regions.
Having a quad-core you can measure your approach speedup (the gain on performance attained because of the parallelization) which is normally given by the division between the time of the non parallelized execution and the time of the parallel execution.
The closer you get to 4 (four cores meaning 1/4th the execution time), the better your parallelization strategy was (once you could evenly distribute work and data).