docs/nifi-docs/components/org.apache.nifi/nifi-standard-nar/1.19.1/org.apache.nifi.processors.standard.TailFile/additionalDetails.html - nifi-site - Git at Google

 <!DOCTYPE html>
 <html lang="en">
     <!--
       Licensed to the Apache Software Foundation (ASF) under one or more
       contributor license agreements.  See the NOTICE file distributed with
       this work for additional information regarding copyright ownership.
       The ASF licenses this file to You under the Apache License, Version 2.0
       (the "License"); you may not use this file except in compliance with
       the License.  You may obtain a copy of the License at
           http://www.apache.org/licenses/LICENSE-2.0
       Unless required by applicable law or agreed to in writing, software
       distributed under the License is distributed on an "AS IS" BASIS,
       WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
       See the License for the specific language governing permissions and
       limitations under the License.
     -->
     <head>
         <meta charset="utf-8" />
         <title>TailFile</title>

         <link rel="stylesheet" href="../../../../../css/component-usage.css" type="text/css" />
     </head>

     <body>

 		<h3>Introduction</h3>
 		<p>
 			This processor offers a powerful capability, allowing the user to periodically look at a file that is actively being written to by another process.
 			When the file changes, the new lines are ingested. This Processor assumes that data in the file is textual.
 		</p>
 		<p>
 			Tailing a file from a filesystem is a seemingly simple but notoriously difficult task. This is because we are periodically checking the contents
 			of a file that is being written to. The file may be constantly changing, or it may rarely change. The file may be "rolled over" (i.e., renamed)
 			and it's important that even after restarting the application (NiFi, in this case), we are able to pick up where we left off. Other additional complexities
 			also come into play. For example, NFS mounted drives may indicate that data is readable but then return NUL bytes (Unicode 0) when attempting to read, as
 			the actual bytes are not yet known (see the &lt;Reread when NUL encountered&gt; property), and file systems have different timestamp granularities.
 		</p>
 		<p>
 			This Processor is designed to handle all of these different cases. This can lead to slightly more complex configuration, but this document should provide
 			you with all you need to get started!
 		</p>


 		<h3>Modes</h3>
         <p>
             This processor is used to tail a file or multiple files, depending on the configured mode. The
             mode to choose depends of the logging pattern followed by the file(s) to tail. In any case, if there
             is a rolling pattern, the rolling files must be plain text files (compression is not supported at
             the moment).
         </p>

         <ul>
         	<li><b>Single file</b>: the processor will tail the file with the path given in 'File(s) to tail' property.</li>
         	<li><b>Multiple files</b>: the processor will look for files into the 'Base directory'. It will look for file recursively
         	according to the 'Recursive lookup' property and will tail all the files matching the regular expression
         	provided in the 'File(s) to tail' property.</li>
         </ul>

         <h3>Rolling filename pattern</h3>
         <p>
         	In case the 'Rolling filename pattern' property is used, when the processor detects that the file to tail has rolled over, the
         	processor will look for possible missing messages in the rolled file. To do so, the processor will use the pattern to find the
         	rolling files in the same directory as the file to tail.
         </p>
         <p>
         	In order to keep this property available in the 'Multiple files' mode when multiples files to tail are in the same directory,
         	it is possible to use the ${filename} tag to reference the name (without extension) of the file to tail. For example, if we have:
         </p>
        	<p>
         	<code>
 	        	/my/path/directory/my-app.log.1<br />
 	        	/my/path/directory/my-app.log<br />
 	        	/my/path/directory/application.log.1<br />
 	        	/my/path/directory/application.log
         	</code>
        	</p>
        	<p>
        		the 'rolling filename pattern' would be <i>${filename}.log.*</i>.
        	</p>
         <h3>Descriptions for different modes and strategies</h3>
         <p>
         	The '<b>Single file</b>' mode assumes that the file to tail has always the same name even if there is a rolling pattern.
         	Example:
         </p>
        	<p>
         	<code>
 	        	/my/path/directory/my-app.log.2<br />
 	        	/my/path/directory/my-app.log.1<br />
 	        	/my/path/directory/my-app.log
         	</code>
        	</p>
        	<p>
         	and new log messages are always appended in my-app.log file.
         </p>
         <p>
         	In case recursivity is set to 'true'. The regular expression for the files to tail must embrace the possible intermediate directories
         	between the base directory and the files to tail. Example:
         </p>
         <p>
         	<code>
 	        	/my/path/directory1/my-app1.log<br />
 	        	/my/path/directory2/my-app2.log<br />
 	        	/my/path/directory3/my-app3.log
         	</code>
        	</p>
         <p>
         	<code>
 	        	Base directory: /my/path<br />
 	        	Files to tail: directory[1-3]/my-app[1-3].log<br />
 	        	Recursivity: true
         	</code>
        	</p>

 		<p>
         	If the processor is configured with '<b>Multiple files</b>' mode, two additional properties are relevant:
         </p>

         <ul>
         	<li><b>Lookup frequency</b>: specifies the minimum duration the processor will wait before listing again the files to tail.</li>
         	<li><b>Maximum age</b>: specifies the necessary minimum duration to consider that no new messages will be appended in a file
         	regarding its last modification date. If the amount of time that has elapsed since the file was modified is larger than this
         	period of time, the file will not be tailed. For example, if a file was modified 24 hours ago and this property is set to 12 hours,
         	the file will not be tailed. But if this property is set to 36 hours, then the file will continue to be tailed.</li>
         </ul>

 		<p>
         	It is necessary to pay attention to 'Lookup frequency' and 'Maximum age' properties, as well as the frequency at which the processor is
         	triggered, in order to achieve high performance. It is recommended to keep 'Maximum age' > 'Lookup frequency' > processor scheduling
         	frequency to avoid missing data. It also recommended not to set 'Maximum Age' too low because if messages are appended in a file
         	after this file has been considered "too old", all the messages in the file may be read again, leading to data duplication.
         </p>

 		<p>
         	If the processor is configured with '<b>Multiple files</b>' mode, the 'Rolling
         	filename pattern' property must be specific enough to ensure that only the rolling files will be listed and not other currently tailed
         	files in the same directory (this can be achieved using ${filename} tag).
         </p>


 		<h3>Handling Multi-Line Messages</h3>
 		<p>
 			Most of the time, when we tail a file, we are happy to receive data periodically, however it was written to the file. There are scenarios, though,
 			where we may have data written in such a way that multiple lines need to be retained together. Take, for example, the following lines of text that
 			might be found in a log file:
 		</p>
 	<code>
 		<pre>
 2021-07-09 14:12:19,731 INFO [main] org.apache.nifi.NiFi Launching NiFi...
 2021-07-09 14:12:19,915 INFO [main] o.a.n.p.AbstractBootstrapPropertiesLoader Determined default application properties path to be '/Users/mpayne/devel/nifi/nifi-assembly/target/nifi-1.14.0-SNAPSHOT-bin/nifi-1.14.0-SNAPSHOT/./conf/nifi.properties'
 2021-07-09 14:12:19,919 INFO [main] o.a.nifi.properties.NiFiPropertiesLoader Loaded 199 properties from /Users/mpayne/devel/nifi/nifi-assembly/target/nifi-1.14.0-SNAPSHOT-bin/nifi-1.14.0-SNAPSHOT/./conf/nifi.properties
 2021-07-09 14:12:19,925 WARN Line 1 of Log Message
 			Line 2: This is an important warning.
 			Line 3: Please do not ignore this warning.
 			Line 4: These lines of text make sense only in the context of the original message.
 2021-07-09 14:12:19,941 INFO [main] Final message in log file
 		</pre>
 	</code>

 		<p>
 			In this case, we may want to ensure that the log lines are not ingested in such a way that our multi-line log message is not broken up into Lines 1 and 2 in one FlowFile
 			and Lines 3 and 4 in another. To accomplish this, the Processor exposes the &lt;Line Start Pattern&gt; property. If we set this Property to a value of
 			<code>\d{4}-\d{2}-\d{2}</code>, then we are telling the Processor that each message should begin with 4 digits, followed by a dash, followed by 2 digits, a dash, and 2 digits.
 			I.e., we are telling it that each message begins with a timestamp in yyyy-MM-dd format. Because of this, even if the Processor runs and sees only Lines 1 and 2 of our
 			multiline log message, it will not ingest the data yet. It will wait until it sees the next message, which starts with a timestamp.
 		</p>
 		<p>
 			Note that, because of this, the last message that the Processor will encounter in the above situation is the "Final message in log file" line. At this point, the Processor does
 			not know whether the next line of text it encounters will be part of this line or a new message. As such, it will not ingest this data. It will wait until either another message
 			is encountered (that matches our regex) or until the file is rolled over (renamed). Because of this, there may be some delay in ingesting the last message in the file, if the process
 			that writes to the file just stops writing at this point.
 		</p>

 		<p>
 			Additionally, we run the chance of the Regular Expression not matching the data in the file. This could result in buffering all of the file's content, which could cause NiFi
 			to run out of memory. To avoid this, the &lt;Max Buffer Size&gt; property limits the amount of data that can be buffered. If this amount of data is buffered, it will be flushed
 			to the FlowFile, even if another message hasn't been encountered.
 		</p>

     </body>
 </html>
	<!DOCTYPE html>
	<html lang="en">
	<!--
	Licensed to the Apache Software Foundation (ASF) under one or more
	contributor license agreements. See the NOTICE file distributed with
	this work for additional information regarding copyright ownership.
	The ASF licenses this file to You under the Apache License, Version 2.0
	(the "License"); you may not use this file except in compliance with
	the License. You may obtain a copy of the License at
	http://www.apache.org/licenses/LICENSE-2.0
	Unless required by applicable law or agreed to in writing, software
	distributed under the License is distributed on an "AS IS" BASIS,
	WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
	See the License for the specific language governing permissions and
	limitations under the License.
	-->
	<head>
	<meta charset="utf-8" />
	<title>TailFile</title>

	<link rel="stylesheet" href="../../../../../css/component-usage.css" type="text/css" />
	</head>

	<body>

	<h3>Introduction</h3>
	<p>
	This processor offers a powerful capability, allowing the user to periodically look at a file that is actively being written to by another process.
	When the file changes, the new lines are ingested. This Processor assumes that data in the file is textual.
	</p>
	<p>
	Tailing a file from a filesystem is a seemingly simple but notoriously difficult task. This is because we are periodically checking the contents
	of a file that is being written to. The file may be constantly changing, or it may rarely change. The file may be "rolled over" (i.e., renamed)
	and it's important that even after restarting the application (NiFi, in this case), we are able to pick up where we left off. Other additional complexities
	also come into play. For example, NFS mounted drives may indicate that data is readable but then return NUL bytes (Unicode 0) when attempting to read, as
	the actual bytes are not yet known (see the <Reread when NUL encountered> property), and file systems have different timestamp granularities.
	</p>
	<p>
	This Processor is designed to handle all of these different cases. This can lead to slightly more complex configuration, but this document should provide
	you with all you need to get started!
	</p>


	<h3>Modes</h3>
	<p>
	This processor is used to tail a file or multiple files, depending on the configured mode. The
	mode to choose depends of the logging pattern followed by the file(s) to tail. In any case, if there
	is a rolling pattern, the rolling files must be plain text files (compression is not supported at
	the moment).
	</p>

	<ul>
	<li><b>Single file</b>: the processor will tail the file with the path given in 'File(s) to tail' property.</li>
	<li><b>Multiple files</b>: the processor will look for files into the 'Base directory'. It will look for file recursively
	according to the 'Recursive lookup' property and will tail all the files matching the regular expression
	provided in the 'File(s) to tail' property.</li>
	</ul>

	<h3>Rolling filename pattern</h3>
	<p>
	In case the 'Rolling filename pattern' property is used, when the processor detects that the file to tail has rolled over, the
	processor will look for possible missing messages in the rolled file. To do so, the processor will use the pattern to find the
	rolling files in the same directory as the file to tail.
	</p>
	<p>
	In order to keep this property available in the 'Multiple files' mode when multiples files to tail are in the same directory,
	it is possible to use the ${filename} tag to reference the name (without extension) of the file to tail. For example, if we have:
	</p>
	<p>
	<code>
	/my/path/directory/my-app.log.1<br />
	/my/path/directory/my-app.log<br />
	/my/path/directory/application.log.1<br />
	/my/path/directory/application.log
	</code>
	</p>
	<p>
	the 'rolling filename pattern' would be <i>${filename}.log.*</i>.
	</p>
	<h3>Descriptions for different modes and strategies</h3>
	<p>
	The '<b>Single file</b>' mode assumes that the file to tail has always the same name even if there is a rolling pattern.
	Example:
	</p>
	<p>
	<code>
	/my/path/directory/my-app.log.2<br />
	/my/path/directory/my-app.log.1<br />
	/my/path/directory/my-app.log
	</code>
	</p>
	<p>
	and new log messages are always appended in my-app.log file.
	</p>
	<p>
	In case recursivity is set to 'true'. The regular expression for the files to tail must embrace the possible intermediate directories
	between the base directory and the files to tail. Example:
	</p>
	<p>
	<code>
	/my/path/directory1/my-app1.log<br />
	/my/path/directory2/my-app2.log<br />
	/my/path/directory3/my-app3.log
	</code>
	</p>
	<p>
	<code>
	Base directory: /my/path<br />
	Files to tail: directory[1-3]/my-app[1-3].log<br />
	Recursivity: true
	</code>
	</p>

	<p>
	If the processor is configured with '<b>Multiple files</b>' mode, two additional properties are relevant:
	</p>

	<ul>
	<li><b>Lookup frequency</b>: specifies the minimum duration the processor will wait before listing again the files to tail.</li>
	<li><b>Maximum age</b>: specifies the necessary minimum duration to consider that no new messages will be appended in a file
	regarding its last modification date. If the amount of time that has elapsed since the file was modified is larger than this
	period of time, the file will not be tailed. For example, if a file was modified 24 hours ago and this property is set to 12 hours,
	the file will not be tailed. But if this property is set to 36 hours, then the file will continue to be tailed.</li>
	</ul>

	<p>
	It is necessary to pay attention to 'Lookup frequency' and 'Maximum age' properties, as well as the frequency at which the processor is
	triggered, in order to achieve high performance. It is recommended to keep 'Maximum age' > 'Lookup frequency' > processor scheduling
	frequency to avoid missing data. It also recommended not to set 'Maximum Age' too low because if messages are appended in a file
	after this file has been considered "too old", all the messages in the file may be read again, leading to data duplication.
	</p>

	<p>
	If the processor is configured with '<b>Multiple files</b>' mode, the 'Rolling
	filename pattern' property must be specific enough to ensure that only the rolling files will be listed and not other currently tailed
	files in the same directory (this can be achieved using ${filename} tag).
	</p>


	<h3>Handling Multi-Line Messages</h3>
	<p>
	Most of the time, when we tail a file, we are happy to receive data periodically, however it was written to the file. There are scenarios, though,
	where we may have data written in such a way that multiple lines need to be retained together. Take, for example, the following lines of text that
	might be found in a log file:
	</p>
	<code>
	<pre>
	2021-07-09 14:12:19,731 INFO [main] org.apache.nifi.NiFi Launching NiFi...
	2021-07-09 14:12:19,915 INFO [main] o.a.n.p.AbstractBootstrapPropertiesLoader Determined default application properties path to be '/Users/mpayne/devel/nifi/nifi-assembly/target/nifi-1.14.0-SNAPSHOT-bin/nifi-1.14.0-SNAPSHOT/./conf/nifi.properties'
	2021-07-09 14:12:19,919 INFO [main] o.a.nifi.properties.NiFiPropertiesLoader Loaded 199 properties from /Users/mpayne/devel/nifi/nifi-assembly/target/nifi-1.14.0-SNAPSHOT-bin/nifi-1.14.0-SNAPSHOT/./conf/nifi.properties
	2021-07-09 14:12:19,925 WARN Line 1 of Log Message
	Line 2: This is an important warning.
	Line 3: Please do not ignore this warning.
	Line 4: These lines of text make sense only in the context of the original message.
	2021-07-09 14:12:19,941 INFO [main] Final message in log file
	</pre>
	</code>

	<p>
	In this case, we may want to ensure that the log lines are not ingested in such a way that our multi-line log message is not broken up into Lines 1 and 2 in one FlowFile
	and Lines 3 and 4 in another. To accomplish this, the Processor exposes the <Line Start Pattern> property. If we set this Property to a value of
	<code>\d{4}-\d{2}-\d{2}</code>, then we are telling the Processor that each message should begin with 4 digits, followed by a dash, followed by 2 digits, a dash, and 2 digits.
	I.e., we are telling it that each message begins with a timestamp in yyyy-MM-dd format. Because of this, even if the Processor runs and sees only Lines 1 and 2 of our
	multiline log message, it will not ingest the data yet. It will wait until it sees the next message, which starts with a timestamp.
	</p>
	<p>
	Note that, because of this, the last message that the Processor will encounter in the above situation is the "Final message in log file" line. At this point, the Processor does
	not know whether the next line of text it encounters will be part of this line or a new message. As such, it will not ingest this data. It will wait until either another message
	is encountered (that matches our regex) or until the file is rolled over (renamed). Because of this, there may be some delay in ingesting the last message in the file, if the process
	that writes to the file just stops writing at this point.
	</p>

	<p>
	Additionally, we run the chance of the Regular Expression not matching the data in the file. This could result in buffering all of the file's content, which could cause NiFi
	to run out of memory. To avoid this, the <Max Buffer Size> property limits the amount of data that can be buffered. If this amount of data is buffered, it will be flushed
	to the FlowFile, even if another message hasn't been encountered.
	</p>

	</body>
	</html>