This is the fifth part of my Pandas & Python Tricks
A few days ago I shared some Python and Pandas tricks to help data analysts and data scientists quickly learn valuable new concepts they may not be aware of. This is also part of the collection of tips that I share daily on LinkedIn.
Combine SQL statements and Pandas
My intuition tells me that over 80% of Data Scientists use Pandas in their daily Data Science activities.
And, I believe it’s because of the benefits it offers of being part of the wider range of the Python universe, making it accessible to many people.
𝙒𝙝𝙖𝙩 𝙖𝙗𝙤𝙪𝙩 𝙎𝙌𝙇?
Even if everyone does not use it on a daily basis (because not all companies necessarily have an SQL database?), SQL’s performance is undeniable. Moreover, it is human readable, which makes it easily understandable even by non-technical people.
❓What if we could find a means of 𝙘𝙤𝙢𝙗𝙞𝙣𝙚 𝙩𝙝𝙚 𝙩𝙝𝙚 𝙤𝙛 𝙗𝙤𝙩𝙝 𝙖𝙣𝙙 𝙖𝙣𝙙 𝙖𝙣𝙙 𝙖𝙣𝙙 »?
✅ This is where 𝗽𝗮𝗻𝗱𝗮𝘀𝗾𝗹 becomes useful 🎉🎉🎉
Below is an illustration 💡 You can also watch the full video here.
Update data from a given dataframe with another dataframe
There are several ways to replace missing values 🧩 in Pandas, from simple imputation to more advanced methods.
But… 🚨
Sometimes you just want to replace them using non-NA values from another DataFrame.
✅ This can be achieved by using Pandas built-in update function.
It aligns the two DataFrames to their index and columns before performing the update.
General syntax ⚙️ below:
“
✨ missing values of 𝗳𝗶𝗿𝘀𝘁_𝗱𝗮𝘁𝗮𝗳𝗿𝗮𝗺𝗲 are replaced by non-missing values of 𝘀𝗲𝗰𝗼𝗻𝗱_𝗱𝗮𝘁𝗮𝗳𝗲
✨ 𝗼𝘃𝗲𝗿𝘄𝗿𝗶𝘁𝗲 = 𝗧𝗿𝘂𝗲 will replace the 𝗳𝗶𝗿𝘀𝘁__ values from the use of data 𝘀𝗲𝗰𝗼𝗻𝗱_𝗱𝗮𝘁𝗮𝗳𝗿𝗮𝗺𝗲, and this is the default. If 𝗼𝘃𝗲𝗿𝘄𝗿𝗶𝘁𝗲=𝗙𝗮𝗹𝘀𝗲 only missing values are replaced.
Here is an illustration 💡
From unstructured data to structured data
Data preprocessing is full of challenges 🔥
Imagine you have this data with candidate information in the following format:
‘𝗔𝗱𝗷𝗮 𝗞𝗼𝗻𝗲: 𝗵𝗮𝘀 𝗠𝗮𝘀𝘁𝗲𝗿 𝗦𝘁𝗮𝘁𝗶𝘀𝘁𝗶𝗰𝘀 𝗦𝘁𝗮𝘁𝗶𝘀𝘁𝗶𝗰𝘀 𝗮𝗻𝗱 𝟮𝟯 𝟮𝟯 𝘆𝗲𝗮𝗿𝘀’ ‘
…
‘𝗙𝗮𝗻𝘁𝗮 𝗧𝗿𝗮𝗼𝗿𝗲: 𝗵𝗮𝘀 𝗣𝗵𝗗 𝗦𝘁𝗮𝘁𝗶𝘀𝘁𝗶𝗰𝘀 𝗦𝘁𝗮𝘁𝗶𝘀𝘁𝗶𝗰𝘀 𝗮𝗻𝗱 𝟯𝟬 𝟯𝟬 𝘆𝗲𝗮𝗿𝘀’ ‘
Next, your task is to generate a table with the following information per candidate for further analysis:
✨ The first and last name
✨ Degree and field of study
✨ Age
🚨 Doing such a task can be daunting 🤯
✅ This is where the 𝘀𝘁𝗿.𝗲𝘅𝘁𝗿𝗮𝗰𝘁() function in Pandas can help!
It is a powerful word processing function for extracting structured information from unstructured textual data.
Below is an illustration 💡
Perform multiple aggregations with the agg() function
If you want to perform multiple aggregate functions like 𝘀𝘂𝗺, 𝗮𝘃𝗲𝗿𝗮𝗴𝗲, 𝗰𝗼𝘂𝗻𝘁… on one or more columns.
✅ You can combine 𝗴𝗿𝗼𝘂𝗽𝗯𝘆() 𝗮𝗻𝗱 𝗮𝗴𝗴() 𝗳𝘂𝗻𝗰𝘁𝗶𝗼𝗻𝘀 from Pandas in one line of code.
Here is a Scenario 🎬 👇🏽
Imagine that the data for these students contains information about:
✨ Students’ fields of study
✨ Their grades
✨ Years of graduation and age of each student.
And, you have been asked to calculate the following information by field of study and by year:
→ The number of students
→ The average rating
→ Average age
Below is an image illustration 💡 to solve the scenario.
Select observations between two specified times
When working with time series data, you may want to select observations between two specified times for further analysis.
✅ This can be quickly achieved using the 𝗯𝗲𝘁𝘄𝗲𝗲𝗻_𝘁𝗶𝗺𝗲() function.
Below is an illustration 💡
Check if all items meet a certain condition
❌ The combination of 𝗳𝗼𝗿 loops and 𝗶𝗳 statements is not always the most elegant way to write Python code.
For example, say you want to check if all items in an iterable meet a certain condition.
Two possibilities may arise:
1️⃣ Either use the for loop and the if statement.
OR
2️⃣ Use the all() integrated function
Below is an illustration 💡
Check if an item meets a certain condition
Similar to the previous case, if you want to check if at least one element of an iterable meets a certain condition.
✅ Then use the any() built-in function which is more elegant than using For loop and whether statement.
The illustration is similar to the image above.
Avoid nested for loops
Writing nested 𝗳𝗼𝗿 loops is almost inevitable as your program gets bigger and more complicated.
❌ It can also make your code difficult to read and maintain.
✅ A better alternative is to use the built-in function 𝗽𝗿𝗼𝗱𝘂𝗰𝘁() instead.
Below is an illustration 💡
Automatically manage index in a list
Imagine you need to access the elements of a list and their indexes at the same time.
One way to do this is to manually manage indexes in a for loop.
✅ Instead, you can use the built-in function 𝗲𝗻𝘂𝗺𝗲𝗿𝗮𝘁𝗲().
This has two main advantages (I can think of).
✨ First, it automatically handles the index variable.
✨ Then makes the code more readable.
Below is an illustration 💡
Thanks for the reading! 🎉 🍾
I hope you found this list of Python and Pandas tips useful! Keep an eye out here as content will be maintained with more daily tips.
Also, if you enjoy reading my stories and want to support my writing, consider become a Medium member. With a $5 per month commitment, you unlock unlimited access to stories on Medium.
Would you buy me a coffee ☕️? → Here is!
Feel free to follow me on AVERAGE, TwitterAnd Youtubeor say hello to LinkedIn. It’s always a pleasure to discuss things about AI, ML, Data Science, NLP and MLOps!
Before leaving find the last two parts of this series below:
Pandas & Python Tips for Data Science and Data Analytics – Part 1
Pandas & Python Tips for Data Science and Data Analytics – Part 2
Pandas and Python Tips for Data Science and Data Analytics – Part 3
Pandas and Python Tips for Data Science and Data Analytics – Part 4