Build Your Digital Infrastructure. Register for the ELEV8 2024 Event Today.
By Clarise Rautenbach
20 November 2023

Flow Diagnostics and Troubleshooting Basics

Share

SPEAKERS:

Tebello Masedi
Customer Success Engineer
Element8

Introduction

The diagnostics and troubleshooting basics for Ignition were aimed at the more experienced developers and users. Tebello Masedi from Element8 provided an overview of how our support team leverages supporting information during the Flow diagnostics and troubleshooting.

Transcript

00:00
I am part of the customer success team, so I work with Laura hand in hand. My specialty is with flow and Canary, while Laura mostly works with ignition and canary. So we share Canary and then we have our specials. But if you do need help with ignition, I’m always available as well. Okay, I am going to be covering diagnostics and troubleshooting basics with flow. Luckily, Laura has had a chance to do a systematic explanation of how you do troubleshooting, which is applicable to any other product, regardless of what we are presenting in front of you. Right. Okay, so earlier this morning we had a presentation with Lenny where he explained what flow does. And hopefully, depending on whether you’re an SI or you are an end user, you have an idea of where it’s applicable.


00:53

If you don’t already have it, and if you do have it now, you know what else you can do with it. Right? I’m going to show you how to diagnose issues within flow just in case you run into some of them. It happens. There’s no such thing as perfect software that runs indefinitely. And usually it will disappoint you when you’re busy with something. It never just stops when you’re not busy with something. Right? There are multiple diagnostic tool sets available within flow. Some of them are part of the flow product itself, but the others can be external to flow. Right. Starting with loggers. Usually these are part of your operating system and have nothing to do specifically with flow. Right. We have the Windows event viewer. Flow logs its event to the Windows event viewer.


01:44

We have the bootstrap service, which is the actual service that runs. Which is the base service that runs flow. So if the bootstrap service is the service that has to run before everything else in flow starts, you can execute it independently of the product so that you can investigate what’s happening if all else fails. Usually this is when there’s a fatal error and then you have to go that far. And if you are running flow on Linux or on Docker, then you would use a Linux journal. We also have the flow configuration tool. This is the actual management tool that you’d be using to configure your flow environment. And you would be logged into a specific flow instance, right?


02:27

You can have multiple of them, but with the config tool you’d be logging in to configure a specific one with your connections and your dashboards. In there you will find the activity viewer. I will explain what these are. Later on. You’ll find the monitor tool, you’ll find change logs, and the server statistics. And lastly, we have third party tools that are useful. Like I said, not specific to flow, but when working with flow, usually we’ll have touch on them depending on what specific product you’re using. Okay, so going into detail, we are going to start with the Windows event viewer. Usually this is the first thing I run to when someone says they have a problem. I want to see your event logs. They will tell you what’s happening with a specific component in flow.


03:19

Luckily, the way flow works is each major component is grouped or is modular. So if you’re working with connections, you are looking at the data engine, which would be, if I zoom in here, want to show you? Okay, no, that’s fine. But you notice over here that it says elevate. And then there’s data engine. The writing is a bit smashed, but you get the gist. There’s flow engine, there is message engine, flow platform. So each component is separated. So when an error comes, you can see or identify where the heck it comes from, right. That way it makes it easier for you to say, okay, my flow server is not working, so I’m only going to focus on the flow server logs instead of having to look at the overall thing. Right. It makes it easier for you to identify the issue.


04:14

This will be errors that have an ID as well. So if you have a specific measure that is causing the issue, but within the flow configuration tool, it doesn’t tell you anything about that in the flow event logs. That’s where you will find out the measure that’s causing the issue by ID. You can then come and search for it and then adjust what needs to be adjusted. Right. Common uses, fault configuration and error, configuration errors and fault analysis. So if there is something that’s wrong, you go and check out that, and then if the issue has to do with the configuration, it will usually be logged in the event logs. Okay. We also have data source and consumer issues, consumers being whatever third party systems you’re integrating data to and data sources being where you’re getting your data, like a canary or an ignition.


05:10

This is part of the Windows administrative tools, so you’ll find it as a native part of Windows. We also have the activity monitor. This is within flow. And what this is it’s more detailed. It’s similar to the event logs, but it’s more detailed and it’s specific to what the engine, the flow component that is actually processing data, what it’s actually doing at the current moment. So you have a measure that’s not updating a KPI that’s not updating it, right? And you want to figure out why the heck it’s not updating. If you use the activity monitor tool, this is where you will find out that the flow engine has actually started processing and looking for data.


05:49

If you look at the first line there, it says prepare measures, which is the phase in which flow will collect data in preparation to actually do the calculation. If you go further down, you’ll see it says measure limit time or time scheme at that time. That’s now when it’s looking. If you’ve set up limits, it’s checking out if the limit is applicable and doing that logic. So this gets to that level of detail, right? And what you’re looking for is in the status column where you’re seeing current and start and end and processing. What you don’t want to see is quarantined or faulted. We will look at that later when we have a demo and that will tell you whether or not you have an issue with your processing itself. Makes sense.


06:35

So in here the first thing you want to do as a person who is just doing an initial investigation is just look for faults and quarantines. That will tell me what else we need to do, right. We also have the monitor tool here. Your measures or KPIs that are in the queue for processing are listed by retrieval type and interval. So each column, each bar there will tell you this is a monthly KPI, this is an hourly KPI. And each stack will actually tell you whether it is received from a data source, is calculated or is a manually entered measure. That way, because of the size of the stack you can tell which one has the most pending intervals, right? So if it’s an hourly measure that number would mean 280 hours are pending to be processed. Makes sense.


07:30

And then lastly we have the SQL management tools or Azure data studio if you use Linux. I went full force with Linux so I’ve switched my OS as well. So I’m coming to realize that most of the things I’m used to in Windows are not there in Linux and I’m working with what I have. But at least now you know that someone can support you. Should you walk into a customer that actually uses Linux right here you will only come when you’ve usually logged a ticket. We’ve tried everything and now we need to make adjustments to the flow database or we are asking for diagnostic information or maybe making changes to your indexes, right? And we will give you a query to execute and then you come back to us with that.


08:19

If you are more servy with SQL, then you can go ahead and do queries against the database to see what’s happening. But we do not encourage making changes to the database or creating extra tables elsewhere as that’s not covered in the Euler because it may end up breaking it. And then you have to do very hectic changes to actually recover what you have. First rule, it’s not documented here, but the first rule of diagnosing something in flow, I always believe in making a backup before you proceed. I live by that rule. Okay, before I bore you with all the slides, I’m going to take you to a demo and what I’ve prepared is a simulated problem. These problems that I’m going to show you are some of the common issues that we run across or come across when we use flow, right?


09:08

And they’re usually the critical ones that end up with data source processing where your data doesn’t come in and you just don’t know where the issue is or flow itself is not starting and you’re stuck and you don’t know what the heck is happening. Right. So I’m going to quickly switch to my VM. Might just need to mirror my screen. There we go. And there. Okay, cool. So what we have here is the flow configuration tool. And this is where I manage my flow instance. I don’t know how many of you guys have used flow before, so my favorite customer is the one, if you saw her, you didn’t. If you saw her, you have to make sure that you become my next favorite. I’m tired of getting a calls of this problem.


10:06

I want to be called once in a while to just say, hi, everything’s working, everything is good. It’s always a problem when you guys call me. I’m kidding. Okay, so what we have here is the flow config tool with the list of my KPIs listed over here. So far, nothing looks to be wrong. Right? We don’t have any errors running here. Right. I go to my deployment and turns out there is something wrong. You do come across some of these issues where the model says there’s nothing wrong, but when you go into your deployment, something is wrong. Right? What do you do next? First and foremost, we discover we’ve just discovered this issue. Now we’re going to identify what’s happening. The flow configuration tool will tell you with these indicators if there’s something wrong.


10:57

In this case, you can see red means danger, universal language, and orange just means it is undeployed. It is not currently working or is in between states, this is a new feature, so users who are used to older versions of flow will not see the orange indicator. Okay, so I am going to hover over the components. So this is my flow platform, which is the main thing. It will be named after the machine where flow is installed. You can have multiple of them depending on how your architecture for flow is or is laid out. Right. In this case, we have everything in all in one box. So I am going to hover over that.


11:33

And this tells me that there might be an issue with my hostname or the port number or I have applied security, but then I didn’t do the required steps to make sure that it’s applied correctly. Right. That’s a suggestion. Usually flow will give you a hint to what’s happening right under there, the components. The way errors work within the config tool is we have child parent relationships between errors and components. Right. So when one of my children has an issue, the parent will know and the siblings will know as well. Okay, so if I go to the flow server, it will tell me that my platform is not running. So it means that the highest priority here is to get the platform running. So how do you see if something is running? You check if it’s deployed or not.


12:18

We apply the concept of deployment within flow. Right. It’s deployed, which means you left it running in the correct state, but still it’s an error. What’s next? Let’s go back to the windows event view and find out. Right, what does it say in the event viewer? I am going to go to my Windows event view over there. You will find it in the windows start menu. I just put it here for your own convenience. And what we’re looking for is the applications and services logs. And we’re going to go to flow. And here is the flow logs. Here are the flow logs. And you can see each component is listed here with the name of the instance or which is the name of the project I had configured. Right.


13:02

Okay, so the latest one, let me just refresh, make sure that I am seeing the correct things. Yeah, the latest one said the bootstrap was stopped, but we just saw that the bootstrap says it’s running in there, something must have stopped it while it was in the running state. It had a critical failure. Right. Okay, let’s see if we can find anything else. Go to the engine. The engine will complain about something else. But based on the hierarchy that we saw on the deployment view, we know the bootstrap is the highest. The bootstrap and the platforms are the highest components or are the most important ones. So you fix it in that order. Right. So in this case, the bootstrap tells us it stopped, and then what do I do next? The logs don’t have anything else.


13:47

So this is where now we go to the bootstrap exe file, where we go to see what the bootstrap says at its executable. So what the bootstrap service does is it will run in console. So obviously the service is running in the background and then there are messages that appear as it runs. Right. So we want to see it as it’s running because at this point it stopped. But if we start it again, we can’t start it. Let’s quickly try to start it, just so that you know I’m not lying to you guys. So I’m going to do that. So I would have called you and asked you, have you tried undeploying and deployed it again? And you tell me yes, and then I’m shocked because it doesn’t want to deploy again. So it tells me it’s not deploying.


14:28

It’s the same message we saw in the tooltip. Right. Okay, so I’m going to click, okay, go back to my. Now we’re going to go to the bootstrap itself. I’m going to open the command prompt, but instead of confusing you guys, we’re going to end up in the command prompt. I’m going to take you to the location using the Windows six file Explorer. So in here we want to go to program files and we want to go to flow software. Flow bootstrap. And this is where the actual flow bootstrap is with its files. Right. And you want to hold shift and quickly open it in Powershell. This is just the option that this is giving me. But what you want to open it in is in the command prompt and not Powershell. So to do that, just type in CMD.


15:23

This is just a tip, it’s nothing special. So now it’s opened in the command prompt, I am going to look for flow and just click tab a few times to open the actual flow bootstrap. Right. And now the bootstrap is running in the background. It will give me a lot of messages. You’ll notice the messages are not in simple English, unfortunately. This is because flow runs is a net application and the errors are logged as is coming in from the component. So if I wait a few moments, I will start seeing what the bootstrap is doing. So if you actually read these, you will notice that they describe what procedure they are on at the moment. So if I go back to the config tool and I’ve undeployed this guy, I’ll just make sure that the subcomponents are unepployed as well. Okay.


16:15

We’re doing good for time. Those things in the last couple today. Yes, thank you very much. You see, we’re working together here. Teamwork. Okay, there we go. I’m going to undeploy this guy and I’m going to now attempt a start in here. Right, so the bootstrap is running. We can see what it’s doing. Now I need to give it a task to do. So I’m going to go ahead and click on deploy. Click, yes. Then go back to this guy. You will notice that flow has a bunch of threads coming in and one of them is returning a 404. This is the same 404 you see on when you can’t reach a website and says 404 not found. It means exactly the same thing.


17:02

What you’re seeing here, this ID, this instance where it says instances and the long ID, this is the actual name of the specific instance you have. This is the internal identifier for that because you can have a system where you have multiple flow instances running on it. Right. So this is a special ID. So what does the 404 mean? Usually the 404 error will tell you that your password is not correct. Why is that the case? The way flow works is when you run the flow bootstrap service, it has to be an administrator account or it has to be with a local administrator account. I’ve seen customers who know that it’s an administrator account, but because the customer itself runs on an ad, they use a domain administrator account.


17:49

And for some reason the domain administrator account doesn’t allow the flow bootstrap to curate the correct files inside of flow to actually go ahead and build what’s needed. Right. So it results in a fold for. Because flow says, okay, I’ve attempted to create those files, but then I can’t find where the files are. Makes sense, right? Okay, so now that we have those running, let’s go. Also check the event viewer. You will see now that we have the event viewer running again. Luckily, the Windows event viewer gives you messages that are a bit more readable than what we just saw. Okay, so now we will see that if I click on the flow engine, I continue getting the same messages I had. But I am looking for Bootstrap.


18:34

You will see when I started it somewhere here, and you will see the subcomponent starting as well. So you will see the bootstrap was the first one to start and it will follow with the subcomponent. We’re not done yet. We’re now getting to a point where we are trying to figure out what’s happening. I said the four meant that there was a password issue. Right. So let’s go change it in the Windows services on the bootstrap service itself. So we go to flow and we are going to look for the bootstrap service and go to properties, make sure that this password is correct. Also make sure whenever you create a local administrator account, do not apply password policies or an expiring account that also forces you to do this every month, whenever it expires. Okay, so let me put in my super secure password.


19:32

One, two, three. And I want to start the service and the last thing I want to do is come back to flow config and just make sure that all is good so that error is gone. Agreed. Okay, now we can start the other components and check if our system is running. Right. So usually once I give you guys a solution like this, I’m also guilty of letting you just walk away without checking the other stuff. So we’re going to check it together so that we remember we did it together. Right? Okay, so now we have our components running. I’m just going to start the ones that we need to look at today. So you’ll see it will be ember switches to red and then it disappears when it’s finished processing. Okay, great. This will come along as well.


20:33

And let’s quickly check if our dashboards are updating because your users will always see the dashboards before they don’t ever use the config tool for the most part. So their interface with flow would be the reports. Right. So let’s go to the report. I have one of the dashboards here. Excuse me, that wasn’t what I meant to open. I want to open that guy and go to charts and engineering. I have a nice dashboard I’ve prepared here. We want to see it open. Opening means the bootstrap started and all the flow server of subcomponent started. This tells me something is either wrong or it’s not finished. Update opening. So I’m going to refresh check if it comes back. If not, I’m going to go back to the event logs to just make sure that the flow server is happy.


21:26

So it just might be that we need a moment, which I think is the case. Where are we? Oh, okay, sorry. We just need to undeploy the platform after doing the bootstrap restart because it was in the deployed state. We just need to undeploy it and redeploy it again. It. Okay. Don’t make me a liar. Where is that? You know what? I think I might have misspelt my. There we go. I was impatient. It happens, guys. It happens. Let’s see. And impatience. Okay, all is running here. I might need to just close this guy. Okay. Can’t have two bootstraps running along side by side. Okay. And I am going to try that one more time. Start. Restart. Sorry. Yeah, it was a tiny thing. You can’t have both of them running. So now it’s been running on that other one.


23:07

And then that’s why we are seeing errors. Okay, I am going to undeploy my platform. Redeploy it. And with that we should be able to see our flow server up and running without issue and a historian. Just undeploy and deploy that. Do a quick refresh, wait for this guy to disappear. Do we have anything in the logs? Refresh. Yeah, we’re good. So what else is happening? You just might need to. What’s the word? Repeat the process again, go back to discover. I think all is good here. This was when both of them were running, so I’m happy with this error. We’ve resolved it. If we refresh. It’s starting everything. Or it might be. Yeah, I’m just impatient. Okay, there. So everything is running. Refresh. There we go. See you. Get a sock. Okay, now we should have our dashboard over here.


24:54

If you’ve attended training before, you would be very familiar with how we built this. Okay. All is good in here. And now your users would be able to see that the dashboards are here. One last thing that we need to do is make sure that the data is updating now that you’re seeing the dashboard, because it’s a different component, the visualization is different from the actual data processing. We need to make sure that the data is actually processing. Right. So now let’s go back to the flow config tool. And we want to go to use the monitor tool. Right. You remember I mentioned that it stacks the data according to type and interval. I’m just going to make space for you guys here by removing that guy. I think we’re done with it and we are going to go to monitor.


25:38

And now you will see there are some measures that need to be processed in here. I’m going to set the update to every second so I can see them going and processing. The lower the number of items here, the better. It just means that you are in the current time and whenever there’s processing, this will go up. But then if everything is okay, it goes back down. So if you have hourly updating measures, these on the hour? No, by default we check for data 60 seconds after the hour passes. So you will see the numbers go up at the beginning of the hour and then it drops. For dailies, the same thing goes up and then drops. Okay, in this case, let’s see. It looks like we have a problem. We have measures that have been quarantined because they have errors. Right.


26:24

You remember in the activity monitor I said we are only looking for quarantine or faulted. That’s the biggest thing. We just saw a bunch of other stacks go down. They’ve even disappeared here. Right. So there must be something else that has an issue in here. So if I go to hourly quarantined, it will tell me what measure it is as well as this small number over here is the ID for that measure, which means I can go and put in that ID in here. And that should take me. Well, I have to go to the model first where my KPIs are. That will take me to this measure. This is an hourly updated production measure and I want to check out what’s happening. So whatever is going on with this measure is that it’s causing it not to update.


27:15

It hasn’t updated in three weeks, it looks like. And now you’re wondering what the heck is happening because there’s no error here. And this is calculated or retrieved from a data source. So there’s nothing fancy we need to configure here. Right. So the next step would be to figure out where the issue is coming from. So we go back to the identify stage, basically. So I am going to say let’s go up through the hierarchy where it gets the data. For this example, we got the data from a tag. It says that this guy is a tag. So let’s right click and show in model. So this guy would be the next part to check. It doesn’t have data. So I’m going to go to the highest point in my model, and in here I can see my actual data source.


28:07

So the tag that I’m getting from my historian is the issue because it has nothing in here. So what I’ll do is quickly adjust this and refresh. So in this case it was a syntax issue. That’s why I knew, obviously. But what we want to do is to make sure. If you’re using placeholders you don’t have any syntax issues. I’m going to adjust that and I am going to just redeploy and then we should be good to go refresh, undeploy and deploy and we will get our values. Luckily once those come in, we should be good to go and the data will come through. Okay, let’s see. I have run out of time, but okay, that should be fine. That should be fine after this finishes deploying.


29:24

But basically what you want to always do is make sure that you use the monitor tools to follow, to look at the components, the affected component, and then from there you go to follow the logs and follow the trail. In that way you always go to the source measure or the source of the error or the parent in the hierarchy and then go solve the problem in there. With that being said, there we go. So it’s updating and then we should see the measures updating as well. I did include it’s finished. We don’t have anything here. Then it means that your customers will get their updates and KPS updated as needed. Okay. I did include slides on the other items, sorry. On the other tools that you can use. The slides will be shared with everyone so don’t worry about that.


30:17

It’s just what the tool is and where to use it. But otherwise. Thank you so much for joining me. One question. What’s the first thing to check when you have an error? I have socks. No, you got something? Someone else. What’s the first thing to check when you have an error? In flow driver? Perform a backup. Not quite. That’s the first thing to do, but what’s the first thing to check? The event logs. Yes. Thank you. Thank you guys. I appreciate your time. Okay.

You might also like