Add a string prefix to each value in a pandas string column
- 2025-02-14 09:49:00
- admin 原创
- 48
问题描述:
I would like to prepend a string to the start of each value in a said column of a pandas dataframe. I am currently using:
df.ix[(df['col'] != False), 'col'] = 'str' + df[(df['col'] != False), 'col']
This seems an inelegant method. Do you know any other way (which maybe also adds the character to rows where that column is 0 or NaN)?
As an example, I would like to turn:
col
1 a
2 0
into:
col
1 stra
2 str0
解决方案 1:
df['col'] = 'str' + df['col'].astype(str)
Example:
>>> df = pd.DataFrame({'col':['a',0]})
>>> df
col
0 a
1 0
>>> df['col'] = 'str' + df['col'].astype(str)
>>> df
col
0 stra
1 str0
解决方案 2:
As an alternative, you can also use an apply
combined with format
(or better with f-strings) which I find slightly more readable if one e.g. also wants to add a suffix or manipulate the element itself:
df = pd.DataFrame({'col':['a', 0]})
df['col'] = df['col'].apply(lambda x: "{}{}".format('str', x))
which also yields the desired output:
col
0 stra
1 str0
If you are using Python 3.6+, you can also use f-strings:
df['col'] = df['col'].apply(lambda x: f"str{x}")
yielding the same output.
The f-string version is almost as fast as @RomanPekar's solution (python 3.6.4):
df = pd.DataFrame({'col':['a', 0]*200000})
%timeit df['col'].apply(lambda x: f"str{x}")
117 ms ± 451 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit 'str' + df['col'].astype(str)
112 ms ± 1.04 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Using format
, however, is indeed far slower:
%timeit df['col'].apply(lambda x: "{}{}".format('str', x))
185 ms ± 1.07 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
解决方案 3:
You can use pandas.Series.map
:
df['col'].map('str{}'.format)
In this example, it will apply the word str
before all your values.
解决方案 4:
If you load you table file with dtype=str
or convert column type to string df['a'] = df['a'].astype(str)
then you can use such approach:
df['a']= 'col' + df['a'].str[:]
This approach allows prepend, append, and subset string of df
.
Works on Pandas v0.23.4, v0.24.1. Don't know about earlier versions.
解决方案 5:
Another solution with .loc:
df = pd.DataFrame({'col': ['a', 0]})
df.loc[df.index, 'col'] = 'string' + df['col'].astype(str)
This is not as quick as solutions above (>1ms per loop slower) but may be useful in case you need conditional change, like:
mask = (df['col'] == 0)
df.loc[mask, 'col'] = 'string' + df['col'].astype(str)
解决方案 6:
Contributing to prefixing columns while controlling NaNs for things like human readable values on csv export.
"_" + df['col1'].replace(np.nan,'').astype(str)
Example:
import sys
import platform
import pandas as pd
import numpy as np
print("python {}".format(platform.python_version(), sys.executable))
print("pandas {}".format(pd.__version__))
print("numpy {}".format(np.__version__))
df = pd.DataFrame({
'col1':["1a","1b","1c",np.nan],
'col2':["2a","2b",np.nan,"2d"],
'col3':[31,32,33,34],
'col4':[np.nan,42,43,np.nan]})
df['col1_prefixed'] = "_" + df['col1'].replace(np.nan,'no value').astype(str)
df['col4_prefixed'] = "_" + df['col4'].replace(np.nan,'no value').astype(str)
print(df)
python 3.7.3
pandas 1.2.3
numpy 1.18.5
col1 col2 col3 col4 col1_prefixed col4_prefixed
0 1a 2a 31 NaN _1a _no value
1 1b 2b 32 42.0 _1b _42.0
2 1c NaN 33 43.0 _1c _43.0
3 NaN 2d 34 NaN _no value _no value
(Sorry for the verbosity, I found this Q while working on an unrelated column type issue and this is my reproduction code)
解决方案 7:
You can use radd()
to element-wise add a string to each value in a column (N.B. make sure to convert the column into a string column using astype()
if the column contains mixed types). An example:
df = pd.DataFrame({'col': ['a', 0]})
df['col'] = df['col'].astype('string').radd('str')
which outputs
col
0 stra
1 str0
It has two advantages over concatenation via +
:
Null handling: If the column contains NaN values,
+
simply returns NaN. For example:
df = pd.DataFrame({'col': ['a', float('nan')]})
df['col'] = 'str' + df['col']
which outputs
col
0 stra
1 NaN
which forces you to handle the NaN later using fillna()
etc.
However, with radd()
, you can directly pass fill_value=
kwarg to handle the NaN values in one function call. For the above example, we can pass fill_value=''
to treat NaN values as an empty string, so that when we add the prefix string, we get a column of strings:
df['col'] = df['col'].radd('str', fill_value='')
which outputs
col
0 stra
1 str
As a side note, there's a difference between using astype(str)
and astype('string')
; one important difference is related to null handling; you can read more about that here.
Method chaining: If you were adding prefixes to strings in a column as part of a method in a pipeline, it might be important to be able to do it using method chaining.
+
forces you to move out of the pipeline whereasradd
clearly shows that the prepending prefixes come after a chain of methods. For example, we can do the following:
df = pd.DataFrame({'col': ['a', 0]})
df.reset_index().astype({'col': 'string'}).radd({'index': 0, 'col': 'str'})
which outputs
index col
0 0 stra
1 1 str0