Remove rows of one Dataframe based on one column of another dataframe

I got two DataFrame and want remove rows in df1 where we have same value in column ‘a’ in df2. Moreover one common value in df2 will only remove one row.

df1=pd.DataFrame({'a':[1,1,2,3,4,4],'b':[1,2,3,4,5,6],'c':[6,5,4,3,2,1]})

df2=pd.DataFrame({'a':[2,4,2],'b':[1,2,3],'c':[6,5,4]})

result=pd.DataFrame({'a':[1,1,3,4],'b':[1,2,4,6],'c':[6,5,3,1]})

Go to Source
Author: Sarim Nabi

Delete duplicated words in the same row in Pandas

i’m pretty new to Python Pandas and to programming. I have a dataframe that looks something like this (just a simplified example):

   A      B  
0  name1  Dog, Dog, Cat
1  name2  Dog, Bird
2  name3  Cat, Cat, Cat
3  name4  Dog, Cat, Bird

I want to delete the duplicated values on each row, so my DataFrame looks like this:

       A      B  
0  name1  Dog, Cat
1  name2  Dog, Bird
2  name3  Cat
3  name4  Dog, Cat, Bird

I saw that I can do something like that with from collections import OrderedDict, but as I said I’m pretty new to programming, and I have no idea how to do that. It would be great if you could help me, thank you!

Go to Source
Author: David Masnou Sánchez

Which python data structure is best to grow dynamically for my use case?

I writing a tool to analyse test reports and create a summary report of the containing data values.
From several reports in this format (could be thousands of those csv files in the end)…

TEST CASE;PARAMETER;MIN;MAX;VALUE;UNIT;RESULT
TestRun;Serial;;;000059;;
TestRun;Date;;;20200220;;
Test Run;Start Time;;;132547;;
TestCase1;Param1;0;100;92;mV;Pass
TestCase1;Param2;0;100;0;mV;Pass

TEST CASE;PARAMETER;MIN;MAX;VALUE;UNIT;RESULT
TestRun;Serial;;;000060;;
TestRun;Date;;;20200220;;
Test Run;Start Time;;;132722;;
TestCase2;Param1;0;100;130;mV;Fail
TestCase2;Param2;0;100;12;mV;Pass

TEST CASE;PARAMETER;MIN;MAX;VALUE;UNIT;RESULT
TestRun;Serial;;;000061;;
TestRun;Date;;;20200220;;
Test Run;Start Time;;;132921;;
TestCase1;Param1;0;100;93;mV;Pass
TestCase1;Param2;0;100;1;mV;Pass
TestCase2;Param1;0;100;131;mV;Fail
TestCase2;Param2;0;100;13;mV;Pass

…my code should create just one summary in this format with one line per processed report:

TestRun_Serial;TestRun_Date;TestRun_StartTime;TestCase1_Param1;TestCase1_Param2;TestCase2_Param1;TestCase2_Param2
000059;20200220;132547;92;0;na;na
000060;20200220;132722;na;na;130;12
000061;20200220;132921;93;1;131;13

One important thing to know is that the test case names as well as the param names are not fixed. This means when iterating over the csv files I will come across new test case names and new param names that have not been part of a previous processed report. So, each time I have a new test case/param combination I would like to extend to a data structure with an additional ‘column’.

In my current code I’m reading the report files into pandas dataframes to access the needed values for each report. My main question now: What is a suitable data structure to collect the data in and write it to a file at the end? I was thinking of another pandas dataframe but learned that growing a dataframe dynamically is a bad bad idea from a performance perspective.
So what would be the preferrable aproach here instead? Is there something like a dictanary with several values for the same key?
Here is the relevant snippet of my current code.. how to continue?

with open(report_summary_path, "w", newline='') as summary_file:
summary_writer = csv.writer(summary_file, delimiter=';', quotechar='"', quoting=csv.QUOTE_MINIMAL)

for report_file in glob.glob(pathname_pattern):
    with open(report_file) as current_report:
        # read from each report file into pandas DataFrame
        df_in = pd.read_csv(current_report, delimiter=';', header=0,
                            names=['TEST CASE', 'PARAMETER', 'MIN', 'MAX', 'VALUE', 'UNIT', 'RESULT'],
                            index_col=['TEST CASE', 'PARAMETER'], usecols=['TEST CASE', 'PARAMETER', 'VALUE'])

        for idx, data in df_in.groupby(level='PARAMETER'):
            test_case = f'{data.index.values[0][0]}'
            test_param = f'{data.index.values[0][1]}'
            data_value = data.values[0][0]

Go to Source
Author: SeBASStian