Reading data and plotting#
Reading from files is the far more standard way to use pandas.
To facilitate this, DataFrame accessors are provided to make it easy to get to PintArray objects.
Read data from csv#
First some imports
In [1]: import pandas as pd
In [2]: import pint
In [3]: import pint_pandas
In [4]: import io
Here’s the contents of the csv file.
In [5]: test_data = """ShaftSpeedIndex,rpm,1200,1200,1200,1600,1600,1600,2300,2300,2300
...: pump,,A,B,C,A,B,C,A,B,C
...: TestDate,No Unit,01/01,01/01,01/01,01/01,01/01,01/01,01/02,01/02,01/02
...: ShaftSpeed,rpm,1200,1200,1200,1600,1600,1600,2300,2300,2300
...: FlowRate,m^3 h^-1,8.72,9.28,9.31,11.61,12.78,13.51,18.32,17.90,19.23
...: DifferentialPressure,kPa,162.03,144.16,136.47,286.86,241.41,204.21,533.17,526.74,440.76
...: ShaftPower,kW,1.32,1.23,1.18,3.09,2.78,2.50,8.59,8.51,7.61
...: Efficiency,dimensionless,30.60,31.16,30.70,30.72,31.83,31.81,32.52,31.67,32.05"""
...:
Let’s read that into a DataFrame. Here io.StringIO is used in place of reading a file from disk, whereas a csv file path would typically be used and is shown commented.
In [6]: df = pd.read_csv(io.StringIO(test_data), header=[0, 1], index_col=[0, 1]).T
# df = pd.read_csv("/path/to/test_data.csv", header=[0, 1])
In [7]: for col in df.columns:
...: try:
...: df[col] = pd.to_numeric(df[col])
...: except ValueError:
...: pass
...:
In [8]: df.dtypes
Out[8]:
TestDate No Unit str
ShaftSpeed rpm int64
FlowRate m^3 h^-1 float64
DifferentialPressure kPa float64
ShaftPower kW float64
Efficiency dimensionless float64
dtype: object
The pint dtype can also be specified directly using the dtype argument.
When values are strings, they are passed to pint.Quantity(), so any format pint can parse is accepted,
including values with or without a space between magnitude and unit, values in a different but compatible unit
(automatically converted), and missing values.
In [9]: simple_data = """mass,distance
...: 1 kg,1 m
...: 1 lb,1 mile
...: 1kg,1mile
...: ,
...: """
...:
In [10]: pd.read_csv(io.StringIO(simple_data), dtype={"mass": "pint[kg]", "distance": "pint[m]"})
Out[10]:
mass distance
0 1.0 1.0
1 0.453592 1609.344
2 1.0 1609.344
3 NaN NaN
Pandas DataFrame Accessors#
Then use the DataFrame’s pint accessor’s quantify method to convert the columns from ndarray to PintArray, with units from the bottom column level.
Using ‘No Unit’ as the unit will prevent quantify converting a column to a PintArray. This can be changed by changing pint_pandas.pint_array.NO_UNIT.
In [11]: df_ = df.pint.quantify(level=-1)
In [12]: df_
Out[12]:
TestDate ShaftSpeed ... ShaftPower Efficiency
ShaftSpeedIndex pump ...
1200 A 01/01 1200 ... 1.32 30.6
B 01/01 1200 ... 1.23 31.16
C 01/01 1200 ... 1.18 30.7
1600 A 01/01 1600 ... 3.09 30.72
B 01/01 1600 ... 2.78 31.83
C 01/01 1600 ... 2.5 31.81
2300 A 01/02 2300 ... 8.59 32.52
B 01/02 2300 ... 8.51 31.67
C 01/02 2300 ... 7.61 32.05
[9 rows x 6 columns]
Let’s confirm the units have been parsed correctly by looking at the dtypes.
In [13]: df_.dtypes
Out[13]:
TestDate str
ShaftSpeed pint[revolutions_per_minute][int64]
FlowRate pint[meter ** 3 / hour][float64]
DifferentialPressure pint[kilopascal][float64]
ShaftPower pint[kilowatt][float64]
Efficiency pint[dimensionless][float64]
dtype: object
Here the Efficiency has been parsed as dimensionless. Let’s change it to percent.
In [14]: df_["Efficiency"] = pint_pandas.PintArray(
....: df_["Efficiency"].values.quantity.m, dtype="pint[percent]"
....: )
....:
In [15]: df_.dtypes
Out[15]:
TestDate str
ShaftSpeed pint[revolutions_per_minute][int64]
FlowRate pint[meter ** 3 / hour][float64]
DifferentialPressure pint[kilopascal][float64]
ShaftPower pint[kilowatt][float64]
Efficiency pint[percent][Float64]
dtype: object
As previously, operations between DataFrame columns are unit aware
In [16]: df_.ShaftPower / df_.ShaftSpeed
Out[16]:
ShaftSpeedIndex pump
1200 A 0.0011
B 0.001025
C 0.000983
1600 A 0.001931
B 0.001737
C 0.001563
2300 A 0.003735
B 0.0037
C 0.003309
dtype: pint[kilowatt / revolutions_per_minute][float64]
In [17]: df_["ShaftTorque"] = df_.ShaftPower / df_.ShaftSpeed
In [18]: df_["FluidPower"] = df_["FlowRate"] * df_["DifferentialPressure"]
In [19]: df_
Out[19]:
TestDate ShaftSpeed ... ShaftTorque FluidPower
ShaftSpeedIndex pump ...
1200 A 01/01 1200 ... 0.0011 1412.9016
B 01/01 1200 ... 0.001025 1337.8048
C 01/01 1200 ... 0.000983 1270.5357
1600 A 01/01 1600 ... 0.001931 3330.4446
B 01/01 1600 ... 0.001737 3085.2198
C 01/01 1600 ... 0.001563 2758.8771
2300 A 01/02 2300 ... 0.003735 9767.6744
B 01/02 2300 ... 0.0037 9428.646
C 01/02 2300 ... 0.003309 8475.8148
[9 rows x 8 columns]
In [20]: df_.groupby(by=["ShaftSpeedIndex"])[['FlowRate', 'DifferentialPressure', 'ShaftPower', 'Efficiency']].mean()
Out[20]:
FlowRate DifferentialPressure ShaftPower Efficiency
ShaftSpeedIndex
1200 9.103333 147.553333 1.243333 30.82
1600 12.633333 244.16 2.79 31.453333
2300 18.483333 500.223333 8.236667 32.08
The DataFrame’s pint.dequantify method then allows us to retrieve the units information as a header row once again.
In [21]: df_.pint.dequantify()
Out[21]:
TestDate ... FluidPower
unit No Unit ... kilopascal * meter ** 3 / hour
ShaftSpeedIndex pump ...
1200 A 01/01 ... 1412.9016
B 01/01 ... 1337.8048
C 01/01 ... 1270.5357
1600 A 01/01 ... 3330.4446
B 01/01 ... 3085.2198
C 01/01 ... 2758.8771
2300 A 01/02 ... 9767.6744
B 01/02 ... 9428.6460
C 01/02 ... 8475.8148
[9 rows x 8 columns]
This allows for some rather powerful abilities. For example, to change a column’s units
In [22]: df_["FluidPower"] = df_["FluidPower"].pint.to("kW")
In [23]: df_["FlowRate"] = df_["FlowRate"].pint.to("L/s")
In [24]: df_["ShaftTorque"] = df_["ShaftTorque"].pint.to("N m")
In [25]: df_.pint.dequantify()
Out[25]:
TestDate ShaftSpeed ... ShaftTorque FluidPower
unit No Unit revolutions_per_minute ... meter * newton kilowatt
ShaftSpeedIndex pump ...
1200 A 01/01 1200 ... 10.504226 0.392473
B 01/01 1200 ... 9.788029 0.371612
C 01/01 1200 ... 9.390142 0.352927
1600 A 01/01 1600 ... 18.442079 0.925123
B 01/01 1600 ... 16.591903 0.857005
C 01/01 1600 ... 14.920776 0.766355
2300 A 01/02 2300 ... 35.664547 2.713243
B 01/02 2300 ... 35.332397 2.619068
C 01/02 2300 ... 31.595716 2.354393
[9 rows x 8 columns]
The units are harder to read than they need be, so lets change pint’s default format for displaying units.
In [26]: pint_pandas.PintType.ureg.formatter.default_format = "P~"
In [27]: df_.pint.dequantify()
Out[27]:
TestDate ShaftSpeed ... ShaftTorque FluidPower
unit No Unit rpm ... m·N kW
ShaftSpeedIndex pump ...
1200 A 01/01 1200 ... 10.504226 0.392473
B 01/01 1200 ... 9.788029 0.371612
C 01/01 1200 ... 9.390142 0.352927
1600 A 01/01 1600 ... 18.442079 0.925123
B 01/01 1600 ... 16.591903 0.857005
C 01/01 1600 ... 14.920776 0.766355
2300 A 01/02 2300 ... 35.664547 2.713243
B 01/02 2300 ... 35.332397 2.619068
C 01/02 2300 ... 31.595716 2.354393
[9 rows x 8 columns]
or the entire table’s units
In [28]: df_.pint.to_base_units().pint.dequantify()
Out[28]:
TestDate ShaftSpeed ... ShaftTorque FluidPower
unit No Unit rad/s ... kg·m²/s² kg·m²/s³
ShaftSpeedIndex pump ...
1200 A 01/01 125 ... 10.504226 392.472667
B 01/01 125 ... 9.788029 371.612444
C 01/01 125 ... 9.390142 352.926583
1600 A 01/01 167 ... 18.442079 925.123500
B 01/01 167 ... 16.591903 857.005500
C 01/01 167 ... 14.920776 766.354750
2300 A 01/02 240 ... 35.664547 2713.242889
B 01/02 240 ... 35.332397 2619.068333
C 01/02 240 ... 31.595716 2354.393000
[9 rows x 8 columns]
Plotting#
Pint’s matplotlib support allows columns with the same dimensionality to be plotted. First, set up matplotlib to use pint’s units.
In [29]: import matplotlib.pyplot as plt
In [30]: pint_pandas.PintType.ureg.setup_matplotlib()
Let’s convert a column to a different unit and plot two columns with different units. Pint’s matplotlib support will automatically convert the units to the first units and add the units to the axis labels.
In [31]: df_['FluidPower'] = df_['FluidPower'].pint.to('W')
In [32]: df_[["ShaftPower", "FluidPower"]].dtypes
Out[32]:
ShaftPower pint[kW][float64]
FluidPower pint[W][float64]
dtype: object
In [33]: fig, ax = plt.subplots()
In [34]: ax = df_[["ShaftPower", "FluidPower"]].unstack("pump").plot(ax=ax)
In [35]: ax.yaxis.units
Out[35]: <Unit('kilowatt')>
In [36]: ax.yaxis.label
Out[36]: Text(55.847222222222214, 0.5, 'kilowatt')
Single row headers#
A parsing function can be passed into df.pint.quantify to handle single row headers.
In [37]: df = pd.DataFrame(
....: {
....: "no_unit_column": pd.Series([i for i in range(4)], dtype="Float64"),
....: "torque [lbf ft]": pd.Series([1.0, 2.0, 2.0, 3.0], dtype="Float64"),
....: }
....: )
....:
In [38]: def parsing_function(column_name):
....: if "[" in column_name:
....: return column_name.split("]")[0].split(" [")
....: return column_name, pint_pandas.pint_array.NO_UNIT
....:
In [39]: df.pint.quantify(parsing_function=parsing_function)
Out[39]:
no_unit_column torque
0 0.0 1.0
1 1.0 2.0
2 2.0 2.0
3 3.0 3.0
Alternatively df.pint.quantify() will attempt to parse single row headers that adhere to the following formats:
{column_name} [{unit}]{column_name} ({unit}){column_name} / {unit}
In [40]: df = pd.DataFrame(
....: {
....: "no_unit_column": pd.Series([i for i in range(4)], dtype="Float64"),
....: "torque [lbf ft]": pd.Series([1.0, 2.0, 2.0, 3.0], dtype="Float64"),
....: }
....: )
....:
In [41]: df_ = df.pint.quantify()
In [42]: df_
Out[42]:
no_unit_column torque
0 0.0 1.0
1 1.0 2.0
2 2.0 2.0
3 3.0 3.0
The reverse operation can be done with df.pint.dequantify() and the writing_function argument.
In [43]: df_.pint.dequantify()
Out[43]:
no_unit_column torque [ft·lbf]
0 0.0 1.0
1 1.0 2.0
2 2.0 2.0
3 3.0 3.0
In [44]: def writing_function(column_name, unit):
....: if unit == pint_pandas.pint_array.NO_UNIT:
....: return column_name
....: return f"{column_name} [{unit}]"
....:
In [45]: df_.pint.dequantify(writing_function=writing_function)
Out[45]:
no_unit_column torque [ft·lbf]
0 0.0 1.0
1 1.0 2.0
2 2.0 2.0
3 3.0 3.0