Finding Perfomance Issues
This document targets people with a technological background, but not necessary developers.
Many times it is easy to notice the application has performance issues because it behaves slower than usual or a process takes to finish longer than it should. But usually, it is not so easy to identify where the problem is. This article aims to be a guide to help to find the pieces causing these issues.
This article guides on finding the problem, but the fix for it is out of scope of the document. Even though all information gathered to reach the point of knowing where the issue, can be really helpful to find a proper solution.
All the issues this article discusses about are in server side, so in case the performance problems are in net or in client side it will not be helpful.
Finding the issue
When a performance issue is detected, these are the steps to take in order to find what is causing it.
Usually, there are more than one performance problem. In this case this flow should be executed several times to find and fix each of them, starting with the biggest ones.
Openbravo is a web application that is served from a web container (Tomcat), most of the business logic is implemented in java, it stores data in a database (PostgreSQL or Oracle). Backend performance problems can be in any of these two sides: Java or Database (DB).
The first thing to do is to identify whether the problem is in DB or Java. A first approach is to understand where the system resources are consumed in.
top is a Unix utility that provides information about which are the processes in execution and how much CPU and memory they use.
This is the output it generates:
The important columns are:
- COMMAND: is the name of the process. Typically, tomcat appears as
java, and database as
postmaster(in case of PostgreSQL) or
- %CPU: shows the percentage of CPU the process is consuming in this moment. Note in case of multiple cores, it is possible this number to be higher than 100%, in case there are multiple parallel threads. This is the column to look at to understand which is the process that is using resources. In the screenshot above, tomcat is consuming 86% whereas DB is just 9%, this would point to continue looking the java process.
- PID: Process Identifier, it is the ID that system uses to identify the process.
- %MEM: It is the amount of memory consumed by the process. Take into account that this number is not significant for the java process. Java allocates memory up to a maximum, and it is shown here even it is free.
- USER: It is the user that launched the process.
The next step, once it is known the problem is in Java side, is to determine which is the process that is causing performance downgrade.
In some occasions the culprit is directly known, in case the rest of the application is acceptably working but:
- there is a process or report that is launched as per user request from UI and it takes more time than expected or
- in case of background processes, Duration field in Process Monitor window is unexepectly high.
In any of this two cases, it is possible to directly profile the bad-perfomant process.
When there is a high CPU consumption, or the application overall performance degrades over time, but it is not know what is causing it, it is possible to do some monitoring on the application.
There are different approaches to do so: using some tool that facilitates the task or using through command line the tools JDK provides for it.
VisualVM is a visual tool that integrates several commandline JDK tools and lightweight profiling capabilities. There are many other similar tools for this purpose, we are focusing in this one because it is included within JDK. Note the purpose of this document is to give an overview of the tools, for a comprehensive guide, please visit the site of each of them.
This tool facilitates monitoring and profiling and it can be used in both, development and production environments because the overhead it adds to the application is small.
When running the application in a development environment, or in a server capable to display graphics, VisualVM can be run locally. To attach the application to profile you only need to select it in the Applications tab.
To attach to an application running in a remote server, there are different approaches. One of them is:
- Download MyAgent.jar and copy it in the server.
- Add the following params the the
-Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxremote.ssl=false -javaagent:/path_to/MyAgent.jar(changing path_to with the actual path)
- On ubuntu Stack or Rpath appliance add also the following to CATALINA_OPTS variable:
- Restart Tomcat.
- In the remote computer where VisualVM is installed, do a tunnel to remote server
ssh -L 3000:localhost:3000 -L 3001:localhost:3001 IPwhere IP is remote server's IP.
- Launch VisualVM, right click on left panel, "Add JMX Connection", Connection: localhost:3000
What to look at
Usually, the problems can come from two different ways: memory or CPU consumption. Depending on which of this two cases the problem resides in, the action to be taken is different.
If using VisualVM (or any other graphical tool), you can see in the monitor tab graphs for CPU and Memory consumption.
- CPU: It displays 2 lines. One for CPU usage and another one for Garbage Collector (GC) activity. In case CPU line is high, it usually means there is one process (or several) in execution. The following step in this case is to take a Thread Dump and/or Profile.
- Heap: The graph shows 2 lines. One for Heap size and another one for Used heap. The first one just indicates which is the currently allocated limit (this is what jvm is consuming from the OS point of view, and is what it is shown by top command), though the actual maximum size it can consume appears on top of it as Max. The important part here the Used heap, but take also into account used doesn't mean it cannot be released if needed, just it has not been already collected; in other words, next GC might reduce the used heap if it is able to. The problem appears when the heap usage is high, GC is launched (you can see GC activity in the CPU graph), but it is not able to release memory because it is still referenced. If more memory is tried to be allocated but none can be freed it can occur two different problems:
- The size to be allocated is bigger than the actual free space, so a
java.lang.OutOfMemoryErrorexception is thrown. This can be seen in the opebravo.log.
- If the size is smaller, it can be allocated, but as limit is approaching, GC collector will be executed more and more often, reaching a point where GC consumes all CPU. Reached this point, the application becomes almost unusable because all available resources are in use to unsuccessfully try to reduce the amount of memory. Sometimes, if the process doesn't finish soon, it is required to restart Tomcat.
- The size to be allocated is bigger than the actual free space, so a
When memory issues are detected, the recommended step is to take a Memory dump.
In case of not having a visual tool to do this analysis, it can be done in command line. The problem is it is not possible to determine how much of CPU consumed by java process is spent in GC and how much in actual process execution.
It is possible to know which the stack of current processes executing
jstack PID, where PID is the one shown in top. Note this command must be executed with the same user that launched the java process. In case this user cannot log in, root user can execute
su -s /bin/sh user -c 'jstack PID'. Note this gives just a snapshot of current processes, but not an idea of how much is consumed by each of them.
To determine which is the current heap usage the following command can be used
jmap -heap PID. This command requires to be executed as root.
A thread dump provides information about stacks of all process in execution at certain moment, but it does not highlight how CPU usage is shared among them nor how long each of them has been in execution. It can help in case profiling is not possible in production server. A way of trying to identify the slow parts is to do several dumps is short periods of time (one each second for example), to compare them looking for methods appearing in all of them, these methods are probably taking long to finish, so they are candidates for being causing performance issues.
Thread dumps are generated quickly so taking them in a production instance does not damage performance.
They can be obtained with VisualVM, in Threads tab by clicking on Thread Dump button.
To get them from command line use
jstack -F -m -l PID > filename where filename is the name of the file that will contain the dump.
VisualVM, as many other tools, allows to do sampling on CPU usage (from Sampler tab), when sampling is active, it periodically takes thread dumps and based on this info, it is able to proximatly determine time spent on each method. After some time of sampling, it is possible to take an snapshot that facilitates navigation through threads and methods. In this way can be known where the CPU resources are spent during some lapse of time.
When the problems are caused by an excessive allocation of memory, the best way to analyze it, is creating a memory dump. This dump contains a snapshot of heap memory in a concrete moment. Note taking a dump is expensive in terms of time (it can last up to 2 minutes) and when it is in progress, jvm is frozen, so application cannot be used. Therefore, it must be used cautiously in production environments, but it is highly recommendable to take one in case of the application becoming unusable because of heap problems.
Getting the dump
It can be obtained from VisualVM from Monitor tab, Heap Dump button. Note by default VisualVM opens the dump just after creating it, this is not necessary, once created you can save it (left panel, right click on the new snapshot and Save as) and close VisualVM, as opening it also requires much memory as well as some time.
It can also be taken in command line with the following instruction:
jmap -dump:format=b,file=filename.hprof PID (note there is not a blankspace after format=b,).
The size of the file is similar to the actual size used in that moment by heap. In case of wanting to transfer this file to be analyzed in another computer, it is recommendable to compress it.
Similarly to other commands explained above only the user that owns the java process or root can get the dump. In some cases, when dumping with root,
-F parameter is required, note
-F takes much more time to be executed and does not include thread stacks.
Depending on Openbravo installation, user that executes tomcat can vary:
- Openbravo Ubuntu package:
$ sudo su - openbravo $ jmap -dump:format=b,file=filename.hprof PID
$ su - # su - tomcat -s /bin/bash -c "jmap -dump:format=b,file=filename.hprof PID"
A very good tool to analyze a memory dump is Eclipse Memory Analyzer (MAT).
When opening an existent dump (File > Open Heap Dump), mat processes it to collect its information, this process takes some minutes to be completed (depending on the size of the dump) and it requires at least the same amount of heap the dump uses. Maximum size for mat is defined in MemoryAnalizer.ini file. Next times the dump is opened, this process is not required so it is fast and doesn't require so much memory.
The following sections briefly explain some of the most useful features mat has.
This first windows shows a diagram that in a single shot visualizes the biggest objects in memory. In case there are a few objects consuming most of the memory, they are clear candidates to be doing something wrong.
Dominator tree allows to easily identify the biggest chunks of retained memory and the keep-alive dependencies among objects. In this tree view, parent objects dominate their children, meaning that if parent was released, all its children could be collected by garbage collector.
It is important to differentiate between shallow and retained heap. Shallow heap is the memory directly consumed by one object, retained heap is the sum of retained heap for the object and all its children in the dominator tree, this is, all the memory that would be removed by GC when the object is garbage collected.
Memory dumps can also contain information about threads! This is very useful to understand which process is generating the big object shown in the dominator tree.
Thread stack view can be opened from Open Query Browser > Java Basics > Thread Stacks. It displays a stack trace of all threads that were running when the dump was taken, with a sum of the retained heap of each of them. Going down in the stack, for each method it is also shown the values of the variables.
Let's go through a very common example (see image in the right). We can see:
- there is a thread retaining 1.5GB
- it is performing a DAL flush. Flushes can be programatically invoked, but they are also automatically called at the end of each request, the time and resources they consume depend on how big is the memory used in that moment by DAL,
- this case is latter one: it is invoked on request finalization, as can be seen because
DalRequestFilter.doFiltermethod is in the thread.
- Reached this point, we want to know which servlet and what paramters it received.
DalRequestFilter.doFiltermethod has a local variable of
RequestFacadeclass containing the
- this object has a
requestDispatcherPathobject within a
strValuevariable that contains the path (URL) of the servlet, its value can be seen by selecting it in left panel, Value tab. In this case it is
/org.openbravo.service.datasource/Invoice, this means the thread retaining 1.5GB is a call to Invoice datasource, but which are the parameters it received to become so big?
requestobject has also the answer to that question,
postDatavariable in it contains all the parameters passed in the post request to the Servlet. Selecting it, left panel will show the contents.
As mentioned above, this case is quite typical: a process working with DAL is doing a bad memory usage. Data Access Layer Performance article explains how DAL processes can manage it better.
When problems are in database side, it is needed to understand which queries are performing incorrectly. Actions to be taken to determine it depend on the database in use.
Logging slow queries
PostgreSQL allows to display in log all queries, including values for its parameters, taking longer than a minimum time. This is very useful to understand which are the slower queries, how much time they take to be executed and how often they are used.
This is defined with the
log_min_duration_statement parameter in
postgresql.conf file (location of this file varies on the system installation). The value assigned to this parameter is an integer number indicating the minimum time in milliseconds to log queries taking more than this time. The amount of time to log depends on the system, but
1000 to log queries longer than 1 second is a good rule of thumb to start with. Also note, it logs "wall time", not CPU time, so in case of a lock affecting a query, this query will appear as slow even in other circumstances it wouldn't have any problem and the query to fix might be the one causing the lock.
After modifying this value, it is not required to restart postgresql. Reloading it is enough, so it can be modified without affecting users in the application. Depending on system, reload can be done executing
Log file location also depends on configuration. It typically, can be found in
Knowing what is going on now
pg_stat_activity view contains a row per process connected to PostgreSQL, displaying queries that are currently in execution, the time they were started, and if they are waiting for some resource to be unlocked.
Querying this view is useful in case of high CPU consumption by postgres process to know which are the querying taking this CPU time.
How a query is executed
Once an slow query has been detected it is possible to know how it is executed, for this purpose
explain command is used.
Note this section is not intended to give detailed information about
explain usage but just some hints, for more information visit PostgreSQL documentation.
Explain can help to understand if a query should be refactored to be improved or if an index would be useful.
Let's see a simple example:
EXPLAIN UPDATE c_bpartner SET name ='aa' WHERE c_bpartner_id='100'; QUERY PLAN ------------------------------------------------------------------------------------------- INDEX Scan USING c_bpartner_idsalesrep ON c_bpartner (cost=0.00..8.34 rows=1 width=3224) INDEX Cond: ((c_bpartner_id)::text = '100'::text) (2 rows)
In this case, we are updating a row in
c_bpartner table filtering by its primary key (PK), as PKs define an index by default this index is used to find the row to update, and finally the cost of the query is small (8.34).
In the following example:
EXPLAIN UPDATE c_bpartner SET name ='--'||name WHERE isactive='N'; QUERY PLAN ---------------------------------------------------------------- Seq Scan ON c_bpartner (cost=0.00..8757.82 rows=1 width=3237) Filter: (isactive = 'N'::bpchar) (2 rows)
All rows inactive rows are updated, cost is much bigger (8757.82) because a sequential scan is done to know which rows should be updated. Sequential scan implies to loop over all records in the table whereas index scan is able to directly go to the required records.
Now let's try to improve this figures:
CREATE INDEX em_mymod_bp_active_idx ON c_bpartner (isactive ASC NULLS LAST); CREATE INDEX EXPLAIN UPDATE c_bpartner SET name ='--'||name WHERE isactive='N'; QUERY PLAN -------------------------------------------------------------------------------------------- INDEX Scan USING em_mymod_bp_active_idx ON c_bpartner (cost=0.00..8.29 rows=1 width=3237) INDEX Cond: (isactive = 'N'::bpchar) (2 rows)
Creating an index for
isactive column makes a difference, now it is not sequentially scanning the table but it uses the index reducing cost from 8757.82 to 8.29.
Just a couple of notes about indexes:
- Indexes are not free. Maintaining and index requires space in disk, and even it can dramatically improve times when querying, it can damage update/insert times because is when modifying data when the index is also updated. Therefore, indexes should not be included blindly, but just where they really make sense.
- Having an index allows to use it but does not forces so. Planner decides if it is worth to use or not, and this decision can be different for a same structure depending on the actual data stored in it. This means, a development environment without actual data can see different execution plans than an production environment with tons of data. Plan can also vary over time when stored data grows up.
- Explain plans for updates shows the plan for the update, but it does not take into account the triggers that might be executed on it. So it is possible a good explain plan not to perform correctly because the execution of the triggers associated to the updated table is not good enough.
Another way to look for missing indexes
As seen in previous section, indexes prevents sequential scans, which are very harmful in tables with big amounts of data. It is possible to know how many times a table has been secuentially scanned. Even this does not directly tell where the index should be included, it is giving a hint that one (or more) indexes would reduce these sequential scans.
To see for each table how many times it has been read and how these reads have been done since last time statistics were initialized. This query can be executed:
SELECT relname, seq_scan, seq_tup_read, idx_scan, idx_tup_fetch FROM pg_stat_user_tables ORDER BY seq_tup_read DESC;
It shows all tables, ordered descendetly by the number of rows read in sequential scans.
Statistics are reinitialized to 0 with:
Knowing a process to be slow, it is useful to reset statistics and execute it to see which tables and how it queries.
More information about statistics can be found in PosgreSQL documentation.
Tracking functions allows to know number of times each PL function has been invoked and how much time they consumed.
SELECT * FROM pg_stat_user_functions ORDER BY self_time DESC
To track this information
track_functions property needs to be enabled in
Tracking SQL Statements
It is also possible to track SQL statements.
This tracking can optionally track the statements executed within PL functions and triggers, giving in this way information about which parts inside them are slow. For example executing:
INSERT INTO c_order ...
would show several statements: one for the insert and also all the ones executed by triggers associated to c_order table.
To configure this:
- Add to
shared_preload_libraries = '$libdir/pg_stat_statements' pg_stat_statements.max = 10000 pg_stat_statements.track = all
- restart PostgreSQL
- In each DB you want to activate it run:
create extension pg_stat_statements
Once it is active statistics can be seen with:
SELECT * FROM pg_stat_statements ORDER BY total_time DESC
These statistics are reset separately from the rest by executing:
For more information visit PostgreSQL documentation
When a PreparedStatemnt is executed (usually, using directly jdbc without sqlc nor DAL) and it is not closed an open cursor is kept open. This has two problems, one is the maximum number of open cursors can be reached causing errors when trying to open more, the other is the memory used for them is leaked as it never can be collected.
A query to get open cursors:
SELECT sql_text, COUNT(*) FROM v$open_cursor GROUP BY sql_text ORDER BY COUNT(*) DESC;
It shows each query that has opened a cursor that has not been already closed and for each of them how many open cursors has. Note that having open cursors while the application is running is normal. What is important here are the numbers of them and to see if they are kept and never closed.
Detecting open cursors and knowing the query helps to look for the processes executing those queries.