Alois Kraus

blog

  Home  |   Contact  |   Syndication    |   Login
  133 Posts | 8 Stories | 368 Comments | 162 Trackbacks

News



Archives

Programming

Friday, July 31, 2015 #

It has been a long wait to finally get the new version of the new Windows Performance Toolkit. I had not much time to test the betas since until now managed call stack resolution was not working.

WPT for Windows 10 makes some nice progress which you can download here. Click on the Download Standalone SDK and run it. Then uncheck everything except Windows Performance Toolkit and

it will install on your machine within minutes. If you download the complete Windows 10 SDK you are not downloading 170MB but several GB.

 

Management Summary

  • Graphs are more readable.
  • Graphs with short spikes are much better rendered without the need to zoom into to see them.
  • Colors for graphs are changeable (finally).
  • Filtering can be undone now in the context menu.
  • Quick Search in all tables.
  • Symbol load dialog was improved.
  • You can configure your own stack tag files in Trace Properties.
  • No new Graph Types (at least with the default providers I enable).
  • 5 Added columns to CPU Usage Sampled
    • Two could be extremely useful: Source File Name and Source Line Number
  • 12 Added columns to CPU Usage Precise.
    • Two are to user mode developers useful: ProcessOutOfMemory and CPU Usage (in view).
  • Occasional crashes are also featured now.

CPU Usage Sampled

image

The biggest improvement is certainly the possibility to see source file and line numbers. You can find out this way where most CPU is spent on which line if the CPU consumption happens in your code.

A quick recap for the newbies: The CPU Usage Sampled graph is generated by taking the stack trace of all running threads 1000 times per second (this is the default). These call stacks are then added together.

A count in your Main method of 1000 means that the method has used one second of CPU time.

If the call stack ends not in your e.g. Main method when the stack trace is taken then your source file will not be displayed there. As usual you have to take a sharp look at the numbers. Your method

may be causing a high CPU consumption but it might never show up with a source and line number because the stack always ends in an external library call (printf in my case) for which I did not have the pdb loaded.

It would be nice to have file and line numbers for managed code as well but this feature is of limited use as it is now.

Ideally I want to see this stuff while I am drilling into a call stack all my way down and not only if the call stack ends in a method which I did compile.

 

For Reference here is the complete list of columns you can choose from. Green are unchanged columns, Bold are new ones. I never have found out how to give a thread a name in WPT.

If anyone knows more about the mystical thread names and how they can be set I would be very interested in it.

WPT 8.1 WPT 10
% Weight % Weight
Address Address
All Count All Count
Annotation Annotation
Compiler Optimization Compiler Optimization
Count Count
CPU CPU
Display Name Display Name
DPC/ISR DPC/ISR
Function Function
  Image RVA
Inlined Functions Inlined Functions
Is PGO'ed Is PGO'ed
Module Module
PGO Counts PGO Counts
PGO Dynamic Instruction Count PGO Dynamic Instruction Count
Priority Priority
Process Process
Process Name Process Name
Rank Rank
  Section Name
  Section RVA
  Source File Name
  Source Line Number
Stack Stack
Table Table
Thread Activity Tag Thread Activity Tag
Thread ID Thread ID
Thread Name Thread Name
Thread Start Function Thread Start Function
Thread Start Module Thread Start Module
TimeStamp TimeStamp
Trace # Trace #
Weight Weight
Weight (in view) Weight (in view)

 

CPU Usage Precise

This is by far the most complex table. With WPT 10 we have 66 columns to choose from to find out how our threads interact with each other.

I am not sure how WPA detects that a process was out of memory but I think the kernel knows quite well if something did go wrong. The ProcessOutOfMemory column could be handy

to check stress tests when the machine was under heavy load which could explain subsequent application failures.

CPU Usage (in view) seems to be an attempt to display a more true thread running time. I am not sure how this value is calculated but it seems that if all cores are in use and you

get significant Ready times then it differs quite a lot. Otherwise the differences are marginal.

 

There have been quite some renaming of columns which might render your custom filter and enablement conditions useless and WPA will show you an error.

image

If your old preset did enable by default all processes except the Idle process which clutters up the view it was set to:

([Series Name]:="NewProcess" AND NOT ([NewProcess]:="Idle (0)"))

With WPT 10 you need to change it to:

([Series Name]:="New Process" AND NOT ([New Process]:="Idle (0)"))

This happens to all users of WPT 8.1 which have saved a custom default profile which contains this enablement rule.

With an empty enablement rule your view will look like

image

and with the rule

image

you will see no difference when you apply it. Why? Because this rule only enables matching stuff but it does not disable the non matching

processes! I am with you that this is very counter intuitive but it is the way it is. If you save this as new default profile and then open a new file

you will no longer get a view where the Idle process is enabled by default.

For reference I list here all available columns with the old and new names so you have an easier time to update your custom filters, enablement

and expansion rules.

Green lines are unchanged, yellow are renamed columns and bold are new columns in WPT 10.

WPT 8.1 WPT 10
% CPU Usage % CPU Usage
  AdjustIncrement
  AdjustReason
Annotation Annotation
Count Count
Count:Waits Count:Waits
Cpu Cpu
CPU Usage (ms) CPU Usage
  CPU Usage (in view)
  DirectSwitchAttempt
IdealCpu Ideal Cpu
  KernelStackNotResident
LastSwitchOutTime (s) Last Switch-Out Time
NewInPri New In Pri
NewOutPri New Out Pri
NewPrevOutPri New Prev Out Pri
NewPrevState New Prev State
NewPrevWaitMode New Prev Wait Mode
NewPrevWaitReason New Prev Wait Reason
NewPriDecr New Pri Decr
NewProcess New Process
NewProcess Name New Process Name
NewQnt New Qnt
NewState New State
NewInSwitchTime (us) New Switch-In Time
NewThreadId New Thread Id
NewThreadStack New Thread Stack
NewThreadStartFunction New Thread Start Function
NewThreadStartModule New Thread Start Module
NewWaitMode New Wait Mode
NewWaitReason New Wait Reason
  NewThreadPrevRemainingQuantum
  NewThreadRank
  NewThreadRemainingQuantum
NextSwitchOutTime (s) Next Switch-Out Time
OldOutPri Old Out Pri
OldProcess Old Process
OldProcess Name Old Process Name
OldQnt Old Qnt
OldState Old State
OldInSwitchTime (us) Old Switch-In Time
OldThreadId Old Thread Id
OldThreadStartFunction Old Thread Start Function
OldThreadStartModule Old Thread Start Module
OldWaitMode Old Wait Mode
OldWaitReason Old Wait Reason
  OldThreadRank
  OldThreadRemainingQuantum
PrevCState Prev CState
  ProcessOutOfMemory
Ready (us) Ready
ReadyThreadStack Ready Thread Stack
ReadyTime (s) Ready Time
ReadyingProcess Readying Process
ReadyingProcess Name Readying Process Name
ReadyingThreadId Readying Thread Id
ReadyingThreadStartFunction Readying Thread Start Function
ReadyingThreadStartModule Readying Thread Start Module
  ReadyThreadInDPC
SwitchInTime (s) Switch-In Time
Table Table
Thread Activity Tag Thread Activity Tag
Thread Name Thread Name
TimeSinceLast (us) Time Since Last
Trace # Trace #
Waits (us) Waits

File IO View

The File IO view has now an additional column named File Path which acts like in Path Tree in Disc IO which is extremely useful if you do differential IO analysis. It seems that my

complaints about the not complete feature parity between Disc and File view have helped a bit.

image

This view has still a bug. If I zoom into it all graphs are disabled and I have to reenable them after every zoom operation. If I zoom out all is ok.

Generic Events

Besides some minor changes the biggest change is that finally it has got a DateTime (local) column. It shows the clock time and not the time since trace start.

This can be very useful for advanced scenarios when someone did report an issue and he did care to write down the time when it did happen.

image

Process View

I did not know it until it was removed. Process Tree is no longer with us. It was not really useful anyway.

Process Tree was added to WPT. That was the reason I never saw it before. After using it for some time I find it is easier to navigate in the tree if

you have a parent process which creates many childs.

 

Disc IO View

The Disc IO view has got only one additional column with the name Boosted which is for my disc IO always false. Not sure if this is a Windows 10 only feature. If someone knows more about it please comment.

 

Windows Performance Recorder

The recorder has not changed much. It has got some additional profiles like ones for .NET, Edge Browser, … The buffer size for .NET Activity is hard coded

to a 160 MB ring buffer which might not be enough for machines were many .NET apps are running. I still do not buy into this easy to use UI since the xml configs

need hard coding the values into it. It can be a hard value or a percentage of the system memory. Both choices are bad if I want to automate performance regression testing with

very specific workloads and tracing requirements. This GUI and the command line tool (wpr.exe) are just too inflexible.

Yes I know the complete xml has inheritance built into it so I could create derived profiles which override only the specific settings I care about. But this is still not dynamic enough

if I want to create a configuration for each problem and set them depending on the use case at hand.

image

  • xperf is still a good tool to stop a profiling session because it does a full CLR rundown.
  • But PerfView is much faster at creating the managed pdbs by not trying to recreate them over and over again with NGen.
  • Tracelog from the Windows SDK is still the only tool to start a trace session where you can configure the Win 8.1 enhancement to the ETW infrastructure.
    • E.g. filter events from a specific process, or disable events or stackwalking for user mode providers.

Summary

The new WPA tool has got some nice new features but they still need some polishing to become truly awesome. In my daily work I miss the automatic addition

of the etl.NGenPdb folder when present for the generated pdbs of an etl file. This is a useful convention and the old WPA was so nice to automatically include it to my symbol path.

WPA is very sophisticated. To use it right is an art which needs nearly daily practice. That could explain why this tool although it has a great UI has still got not so

much traction. It is like Windbg in disguise which will quickly cause information overload to the untrained brain.

But with some basic tricks which I will show in the near future you can crack otherwise unsolvable performance nuts with a few mouse clicks.