SPEAKERS:
Introduction
Transcript
00:03
Speaker 1
All right, are we all seated? We’re all ready to go. Can I start?
00:09
Speaker 2
Cool.
00:10
Speaker 1
And it’s a bit awkward because I don’t get any fancy introductions like all of the C level guys that they got next door. But some of you might know me. I am Laura from element eight. So I am a customer success engineer, and I will be covering ignition diagnostics and troubleshooting today. Cool. So welcome to the session. I see quite a few familiar faces. That is amazing. That means I never have to answer any type of support requests from you guys, like, ever again. I really hope you understand that it is not a joke. Please do not email me. Okay, so in this session, what I’m going to do is I’ll be diving into the built in tools that we use and that you guys can use for gathering diagnostic information about the health of your ignition system.
00:59
Speaker 2
Right?
01:00
Speaker 1
So this session is also going to cover the tools and explain how we use them during our troubleshooting stages.
01:08
Speaker 2
Cool.
01:08
Speaker 1
So some of you might be here because you’re looking for some tips and tricks, and some of you are here because you really do not know what you are doing, right? And that’s okay. It’s obviously a joke. But I’m here to give you some guidance on this topic today and also actually give you those tips and tricks. It’s not some fake stuff that you can find on a blog. Somewhere someone wrote a nice, I don’t know, whole paragraph on it and you still have no idea what to do with it or where to find it. This is real life tools and just like, the actual way that we use everything.
01:46
Speaker 2
Right?
01:46
Speaker 1
So I’m going to kick off with our troubleshooting structure. So we have an in house process that allows us to actually have a set structure and we follow it to make sure that we get all of the information that we need to enable us to assist you guys to the best of our capability. Now, the stages of troubleshooting is pretty much a framework for problem solving issues, and they are defined as discovery, identification, isolation and resolution. Now, our goal as support engineers is to get to the bottom of the problem and obviously to find out the reason of why the issue occurred.
02:25
Speaker 2
Right.
02:25
Speaker 1
And of course, to provide you with a solution. So, covering just the diagnostic resources in ignition, there is quite a few. Just a few. Right. I’m going to discuss them, and I’ll also be showing you the tools that we actually use to take a look at these diagnostic resources. So you can find all of them in the ignition gateway as well as in your program files. Like the logs, the threadums, the metric dashboards, as well as running scripts. So in house we mostly use logs and threadums because most of the time we can find the answers, what we’re looking for in there, right? So starting off with logs, we get two different types of logs. Now the first ones are wrapper logs. They can be found in your program files, under your logs folder, in your inductive automation section, and wrapper logs.
03:22
Speaker 1
So it’s very interesting. I don’t think a lot of you guys know it, but it’s external to the java virtual machine, the JVM. And this means that it can catch startup and shutdown issues. So the wrapper service allows ignition to run as a Windows service or as a daemon process in a Unix or Linux environment. And it basically starts and monitors JVM while it is running. And it logs the JVM states to those wrapper log files. So great. Laura, what does wrapper logs do? Like, why would we even take a look at it? What does it mean? What does it help us? Now they show you the key details of what the wrapper service is seeing and what the subsystems are reporting via those wrapper files, right? So when you’re trying to capture an event. Sorry.
04:11
Speaker 1
Or investigate into the past of what might have caused the issue to occur, we usually start at the wrapper files because that’s the basic thing to start at.
04:22
Speaker 2
Right.
04:23
Speaker 1
The second ones are system logs. So these guys are managed by the JVM and it also allows you to then filter them according to their MCD keys and their severity level. So MCD keys, right. They are called mapped diagnostic context keys and they allow you to specify a specific context. So let’s say for example a specific project and then you can set a logging level for it, meaning you will set all of the loggers for that specific project to that specific level. Now this is useful when you help, it’s useful to help you when you diagnose an issue that is very specific to one system in the gateway. Now each logger, I think a lot of you guys have seen something like this, has a severity level.
05:12
Speaker 1
It starts at trace, which is the lowest severity level, and it can go all the way up to error, which is obviously the highest. But the higher the severity level, the less information you’re going to get from that logger. So that’s usually why I ask you to set a logger to trace before you submit those logs. So I’m going to get the most information from those loggers.
05:34
Speaker 2
Cool.
05:35
Speaker 1
They can be found actually in the exact same place as you can find the wrapper files in your C drive under the logs folder, or you can find them in the gateway under status, the diagnostic log section. So this is viewing the logs in real time, they obviously update in real time, and you can export them from here and also set your logger levels over here.
05:55
Speaker 2
Cool, right?
05:57
Speaker 1
So some recommended tools for actually viewing logs from your side. I’ve got notepad plus. Thank you. I did not say note plus. That happens so much. And beartail, I don’t know who of you guys have ever heard of Baytail.
06:13
Speaker 2
No.
06:14
Speaker 1
Right, I’m going to focus on Notepad plus because it’s like the most generic one. And also it’s very easy to browse multiple files at the same time, especially when you’re using a keyword to kind of track events over multiple files. So depending on the type of message that the logger is reporting, it will usually include stack trace, right, and those are the ones that you can find in a threadump as well. And it gives you a bit of clues as to what subsystems are playing a part in that particular event. Now when we review the wrapper files, I found that again, node plus, notepad plus to be very invaluable when you’re kind of like working through them in terms of searching through multiple log files quickly.
07:03
Speaker 1
And the whole thing is it just gives you like a list of events with timestamp and account of how many events have occurred. And that just makes it very easy to navigate through it. Cool. So this is like a nice tool that you can use yourself. I think most of the newest Windows systems have it already downloaded and installed, so you can use that as well. Then these files might look familiar to a few of you guys. Right now, all of these files that I mentioned, the logs, the metrics, the thread dumps, the system logs, as well as like any type of gateway information that’s very valuable to us when we do any type of supporting and to you as an Si as well.
07:45
Speaker 2
Right?
07:45
Speaker 1
And we usually ask them of you in the first response. So if you email like, no, I want a diagnostics bundle.
07:53
Speaker 2
Right?
07:54
Speaker 1
The diagnostics bundle, you can get that if you click on the generate diagnostics bundle button. And you can find this guy in the status page on your gateway tab, right? So the overview page, it’s right, it’s in the right hand corner and it includes pretty much everything you see over here. So that would be gateway information like thread dumps, system logs, wrapper logs, et cetera. So instead of exporting them one by one click on this button, it downloads a little zip file and you can just pop it in the email.
08:26
Speaker 2
Cool.
08:27
Speaker 1
So the second resource that we use is thread dumps. Now, I was required to ask a question. You guys know we’re handing out the little socks, so who can tell me what is the thread? Obviously I was going to ask like the hardest question of the whole of today, what is the thread? Okay, I’ve got another question a bit later on. No googling, no googling. All right, so a thread is an execution, which is the smallest sequence of programmed instructions that can be managed by a scheduler. A scheduler, right. It’s usually handled by the operating system itself. So what does that mean? Guys don’t actually really need to care about that, like exactly what it means. But in many cases a thread is just a component of an overall process. So how do we obtain these thread dumps?
09:24
Speaker 1
Now in that little diagnostics bundle, it actually gets exported as a file. Otherwise in the gateway there’s a thread section where you can also just export it yourself. And great. Now what does it actually do? So thread dumps can tell us about what processes are currently running and in what state those threads are. Are the threads stuck in a particular state? What’s happening with them? And are they expensive? Are they costing this server any type of resources that we can’t afford, et cetera?
09:59
Speaker 2
Cool.
10:00
Speaker 1
Laura, that’s amazing. What does a thread look like and how do we read it? This is a thread. Okay, so this cutout or this screenshot is what I got from the thread, file, and I just opened it in a text editor. So I’m going to go over a few of the attributes that a thread has. You don’t need to know exactly what they are, what they mean. It’s just so that it’s not very, too much at a time if you look at it.
10:28
Speaker 3
Cool.
10:28
Speaker 1
We have the name, which is obviously the name of the thread. This one is very nice. It tells you it’s the thread for the alarm notification pipeline called pipeline. It’s very descriptive as to what the thread is assigned to, so usually you won’t get confused with what they’re used for. Secondly, we have the ID, so each thread has a unique ID, you’ll see. When you do thread execution tracking, you can create an event mapping of the thread by filtering with its ID. So it’s because threads, if they’re done, they go through a certain state. I’ll cover the state. When I get to that, when the thread is done and it’s completed and it’s being released, that thread can be reassigned to a new process. So the ID is unique in that current process to that thread. Then the state.
11:21
Speaker 1
So like I mentioned, that is the state that the thread is currently in. This one is in waiting state. So threads can have multiple states, it can have new blocking blocked, sorry, running, waiting, timed waiting and terminated. So the CPU usage, that is obviously the percent of the CPU that the thread is currently utilizing. That is also a very good indication if the thread is running high or causing any type of problems, then we have the infamous stack trays. So when people look at areas, they usually freak out and they have no idea where to start.
12:00
Speaker 2
Right?
12:00
Speaker 1
So I always tell people, if you look at an error, yes, this is a thread, but it’s the exact same thing in the sense of a stack trace. So I usually tell people to look at the first sentence of a stack trace because that is the error message. Whatever is below the error message is just a trace back to where that error originated from, what source it came from. So tracking the execution of threads through thousands of lines of code, especially when you have a massive, that sounds horrible. Thread dump. Massive thread. Very, that’s not a very good way to go through it. So what we do is we use a tool that is developed by Paul Griffiths from inductive automation.
12:41
Speaker 1
It’s not an ignition certified tool, but he developed it because he was a software engineer for years and years, and he realized going through logs and thread dumps and so on like this is just a waste of time. So that tool is what we use to overcome exactly this issue. Enter kindling. I don’t know who of you guys know what kindling is. All right, so kindling is very cool. It’s a standalone collection of utilities to help pretty much all of the users of ignition. You don’t need to be any specific type of user to use kindling, and it features various amounts of tools to help work with ignition’s custom data format exports. That’s a lot of words, right? So that includes a log viewer, a straightum viewer, a Stormford cache viewer, as well as just a normal IDB viewer and a gateway backup viewer.
13:41
Speaker 1
So that’s pretty cool. The only thing left that it can’t do is literally tell us, this is the problem, this is how you fix it. I mean, if that was as easy as that, then we wouldn’t have a job, right? So I’m going to move over to our troubleshooting stages. So it’s those four stages that I mentioned a bit earlier, and I’m going to start off with discovery. So I think this is a very easy question. First person to answer me will get the socks. Can you solve a problem without having any idea of what’s going on? There we go. No, I see you eyeing the socks like, since you got here, guys, you can’t solve anything if you have no idea what’s going on.
14:25
Speaker 2
Right?
14:26
Speaker 1
So the discovery is the process of understanding the environment that you are working in. So the discovery stage centers around collecting all of the relevant information that we possibly can about the system, the architecture, the gateway servers, and the actual project that you’re working and that you’re having the issue in.
14:45
Speaker 2
Right.
14:46
Speaker 1
So you might think that Terbe and I can remember every single client’s architecture off the top of our heads, and I wish that was the case, but I’m not Clark Kent. So that is why I usually ask these type of questions in the first email that I would respond to as well, like, what version of ignition are you running? What os are you running?
15:10
Speaker 2
Right.
15:10
Speaker 1
Is this a redundant pair of servers? What’s the Gateway architecture? Is it a production or development server? That’s a very important question that anyone really answers. I mean, I need to know, is the client losing money? Is the production being stopped? Like, what’s going on? You need to provide us with as much information as you possibly can.
15:34
Speaker 2
Right.
15:34
Speaker 1
So the goal of this stage is to collect as much information about the system as possible.
15:40
Speaker 3
Sure.
15:41
Speaker 1
It’s really hot. That light is, like, in my face.
15:44
Speaker 2
All right, sorry, guys.
15:46
Speaker 1
Now, at this stage, we do not actually use resources like thread dumps and wrapper logs and metrics. No, we actually do use metrics. We don’t use any type of diagnostic resources because it’s going to mislead us and it’s not really going to help us. So we use that in the second stage. In the first stage, we actually need to start at the core of the system.
16:11
Speaker 2
Right.
16:11
Speaker 1
So what does the system look like? Now, you can find all of the answers to all of these questions in the gateway just under the status page. Under overview, you have access to what the architecture is of the system. Like, what version is the client running or you running is redundancy setup the OS, any type of funny information that you might provide that might help. You can find all of it there. Exactly in the same tab, under the status page, there’s a performance overview, and that will give you a bit more information about the metrics of the system, like the CPU usage, the thread count, their statuses, et cetera.
16:51
Speaker 1
If you want to dive a little bit deeper into it, under the diagnostics section, you’ll see that there’s a metrics dashboard option and that allows you to actually build out your own metrics dashboard based on what you would like to see. And under connections, there’s a little gateway network option. That’s where you can see your gateway network in list form or as a live diagram as well. So you can instantly see any type of connections being faulted or what the overall architecture looks like. Usually just a nice little screenshot of that helps us as well.
17:22
Speaker 2
Right.
17:22
Speaker 1
So if you wondered what this type of information might look like to us, this is the metrics file opened up in kindling, right? So this is a system that obviously has a high CPU usage, but it’s a massive system, so it kind of makes sense. But I can see, for example, the heap memory being used is quite high. So that already indicates that something is wrong there. So that’s why we like to use this tool. Obviously a bit of graphical visualization helps us as well. Okay. The second stage, identification. So simply stated, this defines the question of what is the problem? So in the identification stage, we work through the information that you guys provide us, right? So this should be centered around understanding the issue, and it involves passing that information.
18:18
Speaker 1
So all of the details needs to be identified and obviously clearly documented. And we like when you guys do it for us in the form of an email. Just by the way, something that I’ve learned from inductive automation, it’s pretty cool. They have a rule called rule zero, and I think it’s something that we should all live by, is trust, but verify. So that means, I mean, guys, customers will support an issue to you, right? That is something that they see. They don’t really understand the system, they don’t know what’s going on, they don’t know what’s the underlying issue. So if they tell you, listen, I can’t see any trains on my train chart. And you ask them, is your database connected? They’ll say yes. Yeah, I checked three days ago. You need to verify that, like right now.
19:07
Speaker 1
That’s your job as an Si, and that’s our job as a support engineer as well. I know sometimes our questions are irritating, but it happens for a reason. So the customer, like I said, might actually be reporting a system, not the underlying problem. And our goal is to determine what that problem is. And obviously what we’re going to troubleshoot. Okay, so at this stage, we start to actually replicate the environment as well as the behavior. So yes, diagnostics bundles are great, but even project backups or gateway backups are better because that includes everything. Now, we also start looking here at error messages, wrapper logs, system logs, and memory dumps. And we use the logger levels to actually trace the errors back to their source. And this is the type of things that you can find in a stack trace.
19:59
Speaker 1
All right, this is also where we ask you guys questions like, what are you trying to accomplish with the project? What is the expected behavior? What’s happening that you don’t like, what you’re seeing on screen, how often is it happening? What are the steps that you guys took and that we need to take to generate the error? Because if we can replicate it, usually that means it’s a bug. Now, this is just a screenshot of the system logs opened up in kindling as well. This is very structured. You’ll see on the side, you’ll see all of your loggers and you can trace them and filter them accordingly, which is very cool right now that we actually know what we need to look into, it’s time to break down all of the elements.
20:45
Speaker 1
So usually this is where most of the people that I interact with tell me they have no idea what to do next. They’re like, Laura, here’s a screenshot of the error. What now? What does it mean? Why is it not working? And how do I fix it? And usually the answer is very straightforward. Usually it’s in one of those resources. So the focus of the isolation stage. There we go. Is to refine our understanding of the problem down to its core parts. So as we do this, we need to outline exactly what we have. This will help us understand the problem in detail, as well as develop questions that we might need to ask each other.
21:23
Speaker 2
Right.
21:23
Speaker 1
So I do not want anyone to ever hesitate to ask why? Why is this component not working? Why is my OPC connection switching between statuses? Why is this happening? Like, the more you understand what’s going on, the better the overall resolution will be. Now, there’s no expectation that any of us actually has the questions, and that’s why we have to help each other out, right? So once we have isolated the issue down to its core elements, we can begin to address if it’s maybe just a configuration issue, if it’s like expected behavior, or if it’s an actual bug. So it’s also possible at this stage that we proved ourselves wrong.
22:06
Speaker 2
Right.
22:07
Speaker 1
And that happens more than you guys think that we have to go back to stage one or two, and we’ve got to go through those stages again, provided with new types of information. Now, the goal is to determine in what subsystem the core issue lies. All right, another screenshot. So in this stage, we take what we’ve learned from discovery and identification to decide what type of tools we’re going to use to troubleshoot further. So here is where we actually move over to thread dumps. Now, they are exported as JSOn files. You guys can actually open them up in notepad plus or just in normal notepad, but going through thousands of lines of code, I showed you guys what a thread amp looks like, right? One thread amp consists, like, out of 20 lines, and that’s not a very effective way to trace an event.
22:59
Speaker 2
Right.
23:00
Speaker 1
So that is also why we like to use kindling. It’s very cool in how you can actually track an event. So the tools that we need to use is based on our understanding on the underlying system that we are working on. So discovery is key to effective identification, and effective identification is key to isolation. Sounds like I’m preaching. That’s horrible. But isolation is pretty much this part where you have to ask yourself the question, why is this thing breaking? And why is it triggering this behavior? Now, again, that’s pretty much what it looks like in kindling. So sorry to break it to you guys, but sometimes ignition isn’t always the issue.
23:45
Speaker 2
Okay.
23:48
Speaker 1
We are nice enough to say, it’s okay, we’re going to help you. We’re going to find the actual issue, even though it’s not ignition.
23:56
Speaker 2
Right.
23:56
Speaker 1
And it happens a lot. So I’m going to give you an example. Think of an OPC connection. The status keeps switching between reconnecting and faulted, and you have absolutely no idea why. And the only type of areas that we see in the logs is just that the OPC device is refusing connection from ignition. And again, you sync, it’s ignition.
24:17
Speaker 2
Right.
24:17
Speaker 1
So for us to be able to see if the issue is in ignition or external to ignition, obviously we use the built in tools to see if it’s ignition. If it’s not, we need to rely on external tools as well. So when we expect that external factors are to blame, we start looking at what we call wiresharks. Okay, who of you guys know what wiresharks are? This might be like a holy grail moment, or you might be thinking, what am I doing here? Like, this isn’t software engineering. This is supposed to be troubleshooting. Okay, don’t freak out. Wiresharks is very helpful.
24:57
Speaker 2
Right?
24:57
Speaker 1
So wireshark, that is the program. It’s a free and open source packet analyzer, right? So it’s used for troubleshooting, analysis and software and communication protocol development. So wire sharks, plural. Those are just the captures of the packets from the network connection, such as from your computer to your office, to the Internet, to the OPC device, et cetera. And the packet that we’re capturing, it’s a discrete unit of data in a typical Ethernet network. So what do we use wiresharks for? First of all, this is what they look like in Wireshark itself. And this information helps us analyze the communication between the gateway and any type of external connections. So it gives us a live view of arriving packets. And it also has a protocol analyzer, which is pretty cool.
25:56
Speaker 1
It identifies the protocol that’s being used by dereferencing the port number in the packet header. Very technical. I know this is not a user friendly tool. It’s not a beginner tool. If you want to use wiresharks effectively, or Wireshark effectively, you need to know exactly how a network operates, meaning you need to understand things like three way TCP handshakes and various protocols like TCP, UDP, DHCP, and ICMP, those type of protocols.
26:28
Speaker 2
Right.
26:29
Speaker 1
But luckily that’s not your work. That’s our job. But if you do want to start dabbling with this, wireshark is free to download. You can check it out yourself. You can capture a few packets yourself and start analyzing them. There’s a lot of tutorials and videos online on how to do it, and the probability of me asking you guys for a wireshark export is going to be high. So you might as well just download it and start playing around with it. Okay.
26:58
Speaker 2
Right.
26:58
Speaker 1
So the last stage is resolution. So my question is, how do you determine that a problem is solved? Now, at this stage, we need to find a clear and understandable answer to our problem. If we can’t clearly explain why it happened, then the issue is not resolved and it’s probably going to happen again. So the primary objective of troubleshooting is not to explain why something happened, but to be able to find the root cause and stop it from happening again. And this is where we might actually be resolving the issue, or we might be escalating it as a bug that we’ve identified and isolated to inductive automation or flow or canary, and also where we would kind of just give you like a workaround, if that is the case.
27:50
Speaker 1
Now, the last thing that we need to do, yes, I know it’s not part of the troubleshooting stages, but this is like an overall effect on those four, is to define a baseline analysis. Now, the troubleshooting stages, it’s an iterative process. Okay. We need to understand the problem fully. This analysis is achieved when we revisit every single one of those stages and we evaluate the information to refine the problem definition. So we review and assess whether the most basic replication steps have been followed. And obviously, if we can agree upon a resolution. If software, like, software support, engineers, SIS and clients, cannot agree upon a resolution for an issue, then we need to go back. We need to start at the beginning. Probably. Hopefully not, but we need to go back. We need to revisit our steps. So, troubleshooting.
28:46
Speaker 1
Yeah, it is a cyclical process. You’ll occasionally miss things, for example, in your discovery stage, that may lead you down a path when you get to, let’s say, identification or isolation, and it’s going to lead you nowhere. You will occasionally misidentify a symptom as the core issue, and you will maybe isolate the wrong subsystem that the issue is coming from. So it’s important to be very resilient to these steps. Everyone is going to make assumptions and they can be a necessary evil to save time.
29:19
Speaker 2
Right.
29:20
Speaker 1
Overall, what I just want to close off with is that troubleshooting is all about being able to step back after you’ve made those assumptions and they haven’t panned out and be able to start again. Anyway, thank you for attending my TED talk.
29:41
Speaker 2
Right.
29:42
Speaker 1
I will not take any questions. I’m going to enjoy a cold one, so cheers. No, I’m kidding. I’ve got two minutes left. I will take questions. Nothing?
29:54
Speaker 2
Great.
29:54
Speaker 1
Damn it. Yes.
29:58
Speaker 3
Is it possible to get. So when you write to the gateway logs for any script you’ve got, is it possible to see that in your design or through a session, your gateway scripts? Yeah.
30:11
Speaker 1
So those gateway scripts are accessible in.
30:13
Speaker 3
The gateway themselves, the gateway logs that you can see through the gateway.
30:22
Speaker 1
Can you see the gateway scripts in your gateway logs?
30:25
Speaker 3
No, sorry. If you’re writing to the logs, like getlogger, can you see the logs, the results in your logs or in your designer?
30:38
Speaker 1
Yes, in your output console.
30:40
Speaker 3
Okay.
30:40
Speaker 1
Yeah, cool. Yeah. So if you write your own loggers, you’ll find them in the designer, in your output console.
30:45
Speaker 3
All right. I’ll have a look into that. Thank you.
30:46
Speaker 2
Cool.
30:47
Speaker 1
Yeah. They’ll also be logged to the loggers to the log files themselves to gateway and designer. Any other questions? I’m glad that everyone knows everything because this is great. Yes. Identify functions that cause memory leaks. So the biggest thing that I’ve seen is people that don’t add any type of return if their function expects returns. So that’s like a big memory leak issue that I’ve seen people do, especially guys that don’t have experience encoding, unfortunately. So it’s not really an easy way to identify that because in the system functions you can see the type of functions that expects returns. But if you write your own, then that’s something that you’ve got to make a note about. You’ve got to remind yourself, no, when we do memory leaks, we dive deeper into memory dumps. No, they won’t be logged to external logs. Unfortunately not.
31:57
Speaker 1
And that usually takes up the most of the time. The longest memory dump that I’ve worked with, the memory leak that I’ve worked with was four months. It, and it came down to someone missing something in a script. Yeah. So that’s fun. Any other questions? Nothing? Cool. And my time is up. Done. I will see you guys on the other side. Close.