----- Editing answer to account for inconsistent spacing:
Not sure what the pythonic approach should be, but here's a method that could work.
Using OP's data sample as an example:
0 date=2015-09-17 time=21:05:35 duration=0 1 date=2015-09-17 time=21:05:36 duration=0 2 date=2015-09-17 time=21:05:37 duration=0 3 date=2015-09-17 time=21:05:38 duration=0 4 date=2015-09-17 time=21:05:39 duration=0 5 date=2015-09-17 time=21:05:40 duration=0I loop through each line and split at the equals sign, then grab the desired text:
import pandas as pd log_data = open('log_sample.txt', 'r') split_list = [] for line in log_data: thing1 = line.split('=') #print(thing1) date = thing1[1][:10] time = thing1[2][:8] dur = thing1[3] split_list.append([date, time, dur]) df = pd.DataFrame(split_list, columns=['date', 'time', 'duration']) df----- First Answer:
As @jezrael mentions in the comments, you can leverage the "sep" argument within read_csv.
pd.read_csv('test.txt', sep=r'\\t', engine='python') #[1]See:
- [1] python pandas read_csv not recognizing \t in tab delimited file
- //pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html
Guide to easier data handling in Python
Many times we have to import log file data from machines which have to be heavily prepared before they are digestible (also called Data Munging). This story provides you with a detailed explanation (including data and code) of readline, loops, split, pop, and regular expressions. Hopefully this will help you process your data more quickly in the future.
Background:
Our robot’s machine log file looks like this:
It is not important for us what the log file means in detail. But we need to understand the basic concept so that we can more easily adapt the code to our needs:
As we can see in above’s screenshot, Robot receives Production Order (PO) 1212201 and moves from position 10 to position 1. Then Robot does whatever Robot has to do and finally returns back to position 1. This is one complete successful step. If Robot would not return back to position 1 that would mean there has been a failure in between. For our training example we only want to calculate the throughput times for those datasets which belong to a prosperous process step.
Solution:
In the interest of efficient and convenient analysis, we will first split this text file into a nice table format. Let’s see how we can achieve this.
Walk the line using Readline
We need to import pandas so we can use the open function of pandas. Then we will readline each line from our log file and append it to our list in a loop. In case of an error we will output the line where the error occurred (try and except).
The list now looks like this:
Split list into dataframe
We convert this “ : “ separated list into a df using DataFrame and Split function. We will call these column headers DateAndTime and Description:
To split the first column “DateAndTime” into two new columns “Date” and “Time”, we first string split this column using space (“ “) as a separator. Using the True option for expand results in a dataframe (while otherwise it would have been a Pandas Series Object as output). n defines the numbers of max separations to make in a single string, which we set to be 1 (default is -1 which means all) .
Now we can simply add the Date (position 0) into a new column DT, and similarly for Time from position 1.
Extracting strings
Strings’ extract function searches for everything which is in between < and >. This extract function possibly looks a little puzzling, so we will have a look at another option later on, which you might find to be more intuitive.
Then we split everything out of column Description which comes after >
For extracting Year, Month, Day etc. out of Datestamp, Pandas’ to_datetime function works like a charm:
Regular expression operations
The re module provides regular expression matching operations similar to those found in Perl and is very powerful for string operations (see Python.org for more details). Remember we used str.extract(‘\<(.*)\>’) in above’s code to receive the string in between <>? With re we can take the next step and search and split very specifically.
Firstly, we will set up empty buckets:
Regular expressions are compiled into pattern objects, which have methods for various operations such as searching for pattern matches or performing string substitutions. We are using findall, split, and pop for this:
Here comes the interesting part: we want to calculate the throughput time only for each consecutive completed process. In tabular words: when “to” position returns back to “from” position and vice versa:
Finally returns in what we are striving for.
df.head()Summary:
Congratulations, we have successfully imported a log file and prepared it for lead time calculation with some special adjustments. Hopefully the explanations make it easy for you to customize it to your needs. You can find the Jupyter Notebook and log file in my Github repository. Many thanks for reading, I hope this was supportive! Feel free to connect with me on LinkedIn, Twitter or Workrooms.