dataframe: count multiple occurrences across all columns and output dataframe with same columns, single occcurrances as indexesnicobar
I have a Pandas dataframe like this: >>> df = pd.DataFrame({'2012':['A','A','B','A'],'2013':['A','B','C','C'],'2014':['A','C','Z','C']}) >>> df 2012 2013 2014 0 A A A 1 A B C 2 B C Z 3 A C C From it, I need to create another dataframe like this: 2012 2013 2014 A 3 1 1 B 1 1 0 C 0 2 2 where I am basically counting some (A,B,C but not Z) of the occurrences of the labels in every column, turning them
11 May 2024 at 07:16

dataframe: count multiple occurrences across all columns and output dataframe with same columns, single occcurrances as indexes

I have a Pandas dataframe like this:

>>> df = pd.DataFrame({'2012':['A','A','B','A'],'2013':['A','B','C','C'],'2014':['A','C','Z','C']})
>>> df

  2012 2013 2014
0    A    A    A
1    A    B    C
2    B    C    Z
3    A    C    C

From it, I need to create another dataframe like this:

   2012  2013  2014
A     3     1     1
B     1     1     0
C     0     2     2

where I am basically counting some (A,B,C but not Z) of the occurrences of the labels in every column, turning them into indexes, and showing their count per year.

I did come out with a solution that involves iteration:

>>> indexes = ['A','B','C']
>>> for idx in indexes:
        df2.loc[idx] = (df == idx).sum()
>>> df2

   2012  2013  2014
A     3     1     1
B     1     1     0
C     0     2     2

This outputs exactly what I need. But I wonder, is there a way to do it in one shot without iteration?

I played around with values_counts(), pivot_table() and groupby() without success. All Google searches I found point to this type of count but across one column only.

Thanks in advance to whoever may help!

Remove observations within 7 days of each other within specific ID group

I have a pandas dataframe with ID and dates like this:

ID	Date
111	16/09/2021
111	14/03/2022
111	18/03/2022
111	21/03/2022
111	22/03/2022
222	27/03/2022
222	30/03/2022
222	4/04/2022
222	6/04/2022
222	13/04/2022

For each ID, I would like to filter the table and remove observations that are within 7 days of each other. But I want to keep the earliest date of the dates that are within 7 days of each other so that each ID will have unique dates that are more than 7 days apart and do not contain other dates in between:

ID	Date
111	16/09/2021
111	14/03/2022
111	22/03/2022
222	27/03/2022
222	4/04/2022
222	13/04/2022

I'm quite new to python and pandas dataframe so hoping someone can assist and provide some pointers. There is a similar SO question How do I remove observations within 7 days of each other within a specific ID group? but this was done in R so hoping there is something similar that can be done with Pandas.

Why doesn't pandas groupby.diff use the grouping keys as index?

pandas groupby().diff() operation does not comply with the default as index=True parameter and does not use the keys in groupby as indexes.

Unlike other aggregate functions such as groupby().sum() or groupby().mean(). I want to know the reason.

python median based on aggregation and current record exclusion

I would like to add a new column which reflects a computed median based on value1 grouping on the key value but which excludes each record itself from the median calc.

Below is an example of the data with what my desired output should be:

df = pd.DataFrame({ 'key': ['A', 'A', 'A', 'A', 'A', 'A', 'A', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C'], 'value1': [0.1, 0.244, 0.373, 0.514, 0.663, 0.786, 0.902, 1.01, 1.151, 1.295, 1.434, 1.541, 1.679, 1.793, 1.94, 2.049, 2.164, 2.284, 2.432, 2.533, 2.68, 2.786, 2.906, 3.008, 3.136], 'value2': ['Dept1', 'Dept2', 'Dept3', 'Dept4', 'Dept5', 'Dept6', 'Dept7', 'Dept8', 'Dept9', 'Dept10', 'Dept11', 'Dept12', 'Dept13', 'Dept14', 'Dept15', 'Dept16', 'Dept17', 'Dept18', 'Dept19', 'Dept20', 'Dept21', 'Dept22', 'Dept23', 'Dept24', 'Dept25'], 'desired_output':['0.589', '0.589', '0.589', '0.518', '0.444', '0.444', '0.444', '1.61', '1.61', '1.61', '1.61', '1.557', '1.488', '1.488', '1.488', '1.488', '2.733', '2.733', '2.733', '2.733', '2.66', '2.607', '2.607', '2.607', '2.607'] })

I have tried this code which returns an error of "Transform function invalid for data types":

df['desired_output2'] = df.groupby(['key'], as_index = False)['value1'].transform(lambda x: [x[(v['value2'] != x['value2'])].median() for v in x])

I think a groupby with looping lambda should work for this but can't get the syntax right.

Get an average of the unique values of a column with group by condition and assign it to same table

I'm trying to get an average of the unique values of a column with a group by condition.

In the method, I followed I had to create a new data frame for unique values and then merge the results into the main data frame.

Is there a better way to use the pandas transform method to achieve my result?

In below reproducible example, I'm getting 1. the average of the count column unique values under each user and 2. then getting the average of the count column unique values under each user for value in category column-wise.

 df1 = pd.DataFrame({'user_id':['A','A','A', 'B','B','B', 'D','D','D', 'E','E'],
                'Category':['ABC','ABC','ABC','ABC','ABC','ABC','XYZ','XYZ','XYZ','XYZ','XYZ'],
                'counts':[3,3,3,2,2,1,2,1,2,2,2]})

df1_unique = df1.drop_duplicates(["user_id","Category",'counts'], keep='first')

df1["Overall_average__Unique_Counts"] = df1_unique['counts'].mean()

df1_unique["Categorywise_average_Unique_counts"] = df1_unique.groupby(["Category"])['counts'].transform('mean')

df2 = pd.merge(df1,df1_unique[["user_id","Category","counts","Categorywise_average_Unique_counts"]],on=["user_id","Category","counts"],how='left')

This the resultant table

print(df2)
   user_id Category  counts  Overall_average__Unique_Counts  \
0        A      ABC       3                            1.83   
1        A      ABC       3                            1.83   
2        A      ABC       3                            1.83   
3        B      ABC       2                            1.83   
4        B      ABC       2                            1.83   
5        B      ABC       1                            1.83   
6        D      XYZ       2                            1.83   
7        D      XYZ       1                            1.83   
8        D      XYZ       2                            1.83   
9        E      XYZ       2                            1.83   
10       E      XYZ       2                            1.83   

    Categorywise_average_Unique_counts  
0                                 2.00  
1                                 2.00  
2                                 2.00  
3                                 2.00  
4                                 2.00  
5                                 2.00  
6                                 1.67  
7                                 1.67  
8                                 1.67  
9                                 1.67  
10                                1.67

How to create a column that calculates the difference between two columns for a series of rows that have matching values for two other columns

I have been trying to use dplyr, strigr, and grepl to create a new column that calculates the difference between a column in a dataset that has paired columns with matched values in two column variables and different values for other two columns, one of which is the one of interest in which I would like to calculate the difference.

The matching values for the sets of rows are Date and Time, and the unmatched values for the sets of rows are Context and IndividualID.

The new column that I want to create would apply to the difference between IndividualID in the sets of rows that have the matching values abovementioned (Date, Time).

Here how the dataset looks like,

Date        Time  Context    SeriesNumber IndividualID
01/01/2000  8.00  Mixed       1           a, b, c, d, e, f
01/01/2000  8.00  NON_Mixed   1           d, e, f, g, h
01/01/2000  9.00  Mixed       2           c, d, e, f
01/01/2000  9.00  NON_Mixed   2           a, b, c, d
01/01/2000  9.00  NON_Mixed   2           e, f, g, h

So the new column would be IndividualID_Difference, and would ONLY be calculated for the NON_Mixed contexts, for the mixed it should stay the same.

Date        Time  Context   SeriesNumber  IndividualID       IndividualID_Difference
01/01/2000  8.00  Mixed      1            a, b, c, d, e, f   a, b, c, d, e, f  
01/01/2000  8.00  NON_Mixed  1            d, e, f, g, h      g, h
01/01/2000  9.00  Mixed      2            c, d, e, f         c, d, e, f
01/01/2000  9.00  NON_Mixed  2            a, b, c, d         a, b
01/01/2000  9.00  NON_Mixed  2            e, f, g, h         g, h

Second row, d, e, f are removed because they are present in the matched row (first row with same Time, Date and SeriesNumber, but different Context) that is the Mixed context, they are repeated in the NonMixed, that´s why they need to be removed.

In the fourth row, a, b are removed because they are present in the third row, that is the matched row with same Time, Date and SeriesNumber and different Context.

In the fifth row, e, f are removed because they are present in the third row, that is the matched row with same Time, Date and SeriesNumber and different Context.

The list of IndividualIDs is rather immense.

thanks in case you knew how to come up with a solution for this tricky question!

How to add a column or change data in each group after using group by in Pandas?

I am now using Pandas to handle some data. After I used group by in pandas, the simplified DataFrame's format is [MMSI(Vessel_ID), BaseTime, Location, Speed, Course,...].

I use

for MMSI, group in grouped_df:
    print(MMSI)
    print(group)

to print the data.

For example, one group of data is:

             MMSI         BaseDateTime       LAT        LON  SOG  COG
1507  538007509.0  2022-12-08T00:02:25  49.29104 -123.19135  0.0  9.6   
1508  538007509.0  2022-12-08T00:05:25  49.29102 -123.19138  0.0  9.6

I want to add a column which is the time difference of two points.

Below is the Output I want

             MMSI         BaseDateTime       LAT        LON  SOG  COG   Time-diff
1507  538007509.0  2022-12-08T00:02:25  49.29104 -123.19135  0.0  9.6   3.0(hours)
1508  538007509.0  2022-12-08T00:05:25  49.29102 -123.19138  0.0  9.6   Na

So I use the code below to try to get the result:

for MMSI, group in grouped_df:
    group = group.sort_values(by='BaseDateTime')
    group['new-time'] = group.shift(-1)['BaseDateTime']
    group.dropna()

    for x in group.index:
      group.loc[x,'time-diff'] = get_timediff(group.loc[x,'new-time'],group.loc[x,'BaseDateTime']) # A function to calculate the time difference


    group['GROUP'] = group['time-diff'].fillna(np.inf).ge(2).cumsum()
    # When Time-diff >= 2hours split them into different group

I can use print to show group['GROUP'] and group['time-diff']. The result is not shown after I tried to visit grouped_df again. There's a warning showing that my group in grouped_df is just a copy of a slice from a DataFrame and it recommend me using .loc[row_indexer,col_indexer] = value instead. But in this case I don't know how to use .loc to visit the specific [row,col].

At the very beginning, I tried to use

  grouped_df['new-time'] = grouped_df.shift(-1)['BaseDateTime']
  grouped_df.dropna()

But it shows

'DataFrameGroupBy' object does not support item assignment

Now my solution is create an empty_df and then concatenate the groups in grouped_df step by step like this:

df['time-diff'] = pd.Series(dtype='float64')
df['GROUP'] = pd.Series(dtype='int')
grouped_df = df.groupby('MMSI')
for MMSI, group in grouped_df:

    # ... as the same as the code above
    group = group.sort_values(by='BaseDateTime')
    group['new-time'] = group.shift(-1)['BaseDateTime']
    group.dropna()

    for x in group.index:
      group.loc[x,'time-diff'] = get_timediff(group.loc[x,'new-time'],group.loc[x,'BaseDateTime']) # A function to calculate the time difference


    group['GROUP'] = group['time-diff'].fillna(np.inf).ge(2).cumsum()
    # ... as the same as the code above

    frame = [empty_df, group]
    empty_df = pd.concat(frames)

I am not satisfied with this solution but I didn't find a proper way to change the value in grouped_df.

I'm now trying to use the solution from [this question](make a change in each group after groupby) to handle the DataFrame before group by.

Can someone help me?

select distinct values and exclude rows by condition

There is a table with columns category not null, sub_category default null.

category is always set
sub_category is optional

I need to select distinct category,sub_category from table and

exclude records where sub_category is not null and null
keep record if there is only one combination of category + <null>

category    sub_category  
girl_toys        fluffy
girl_toys        dolls
drawing          <null> -- keep it since only one null sub_category
paining          red_color
paining          <null> -- exclude since there is non null sub_category

expected result:

category    sub_category  
girl_toys        fluffy
girl_toys        dolls
drawing          <null> -- keep is since drawing has only one null
paining          red_color -- <null> record excluded since there is non null sub_category exists

How to calculate percentages with GROUP BY query over many different variables?

I have a source table containing many character columns and another final numeric column (named ASSISTANCE). Just for completeness, let's say there are about 30 character columns.

Here is an example of the source table. The final numeric variable ASSISTANCE, that each character variable is to split into is found at the last column:

AGE_GROUP	GENDER_TYP	Employment_flag	Indigenous_flag	Remoteness	ASSISTANCE
0 to 7	Male	Yes	No	Very remote	1
8 to 12	Female	Yes	No	Metro	2
0 to 7	Female	No	No	Metro	2
13 to 18	Not stated	No	No	Rural	3
13 to 18	Not stated	No	No	Remote	3

(Note, I refer to Question_response below, as a placeholder for any of those above variables)

For each of those character columns (say, the character column Question_response, taking on values YES, NO, UNKNOWN), I would like to get the percentage within each of those values, split by the numeric ASSISTANCE (ranging from 1 to 3) variable.
The numerator would be the count of respondents who responded within each subgroup of the Question_response variable and the ASSISTANCE variable, and the denominator is the count of respondents within each subgroup of only the Question_response variable.

So here is what an example would look like:

Question_response	ASSISTANCE	count	char_pct
YES	1	10	0.2222
YES	2	20	0.4444
YES	3	15	0.3333
NO	1	2	0.1666
NO	2	4	0.3333
NO	3	6	0.5
UNKNOWN	1	2	0.2
UNKNOWN	2	3	0.3
UNKNOWN	3	5	0.5

Now, there are a lot of these character variables, similar to Question_Response. How would I go about doing this in a fast automated way for all of the character variables, without copy pasting code and changing 1 variable each time? Below is the code I found for just 1 variable.

Here is the code for just 1 iteration of one of the character variables.

select split.Question_response as Question_response
,split.ASSISTANCE as ASSISTANCE
,split.count as count
,(100.0 * split.count)/total.count as char_pct
from
(select Question_response, ASSISTANCE, count(*) as count
from source_table
group by Question_response, ASSISTANCE
) as split
join
(select Question_response, count(*) as count
from source-table
group by Question_response
) as total
on total.Question_response = split.Question_response
order by Question_response, ASSISTANCE

An alternative piece of code for the above is:

select Question_response
,ASSISTANCE
,count(*) as count
,100.0 * count(*)/sum(count(*)) over (partition by Question_response) as char_pct
from source_table
group by Question_response, ASSISTANCE
order by Question_response, ASSISTANCE

I understand the logic here, but I would like a final table containing this split by ALL character variables.

This final table would look something like this. It stacks up all the tables that have been produced:

Variable	Variable_value	ASSISTANCE	count	char_pct
Question_response	YES	1	10	0.2222
Question_response	YES	2	20	0.4444
Question_response	YES	3	15	0.3333
Question_response	NO	1	2	0.1666
Question_response	NO	2	4	0.3333
Question_response	NO	3	6	0.5
Question_response	UNKNOWN	1	2	0.2
Question_response	UNKNOWN	2	3	0.3
Question_response	UNKNOWN	3	5	0.5
Language_spoken	ENGLISH	1	10	0.2222
Language_spoken	ENGLISH	2	20	0.4444
Language_spoken	ENGLISH	3	15	0.3333
Language_spoken	NON_ENGLISH	1	2	0.1666
Language_spoken	NON_ENGLISH	2	4	0.3333
Language_spoken	NON_ENGLISH	3	6	0.5
...	...	...	...	...

Where Language_spoken is another character variable from source_table.
For more clarity, the first char_pct calculation is done by 10/(10+20+15) = 0.2222

Thanks so much!!!!

Calculating the mean based on two dataframes R

I have a df1 with variable names v1, v2 and v3. In df2 I have variables ID, Days, Hours and BT. For each ID, I have 7 days and for each day I have 24 Hours. I want to check the values of v1 , v2 and v3 in ID column of df2 and calculate the mean value by adding the BT values of each v1, v2, and v3 for each Day and Hour combination and then taking the mean.

Suppose df1 has following value

v1	v2	v3
1	3	2
1	2	2
1	4	2

Table2:

ID	Days	BT
1	1	18
3	1	20
2	1	17

From Table 2 I am assuming the following criteria for the first row of df1

If I have BT values for each combination of v1, v2, and v3 for Day 1, Hour 00, let's say:

BT(v1=1,  Day=1, Hour=00) = 18
BT(v2=3,  Day=1, Hour=00) = 20
BT(v3=2,  Day=1, Hour=00) = 17

It will give me a value of 18.3

Resultant Table

v1	v2	v3	Days	Hours	Mean_BT
1	3	2	1	0	18.3
1	3	2	1	1	16.6
1	3	2	1	2	17.0
1	3	2	1	3	15.0
1	3	2	1	4	9.3
1	3	2	1	5	12.3

and so on for 840 Observatinos.

Below is the code to generate table 1 and table 2.

set.seed(0)

table1 <- data.frame(
  v1 = sample(1:5, 5, replace = TRUE),
  v2 = sample(1:5, 5, replace = TRUE),
  v3 = sample(1:5, 5, replace = TRUE)
)    
# Table 2
table2 <- expand.grid(
  ID = 1:5,
  Days = 1:7,
  Hours = 0:23
)

table2$BT <- sample(1:25, nrow(table2), replace = TRUE)

# Displaying Table 2
print("Table 2:")
print(table2)

Fill in the value in the column based on the condition in another column

I have the dataframe.

data={"ID":[1,1,1,1,1,1,1,1,1,2,2,2],
"Year":[2000,2001,2002,2003,2004,1997,1998,2003,2004,1997,1998,2005],
"Firm":["A","A","B","B","A","A","A","A","B","B","A","A"],
"Count":[0,1,0,0,0,0,0,0,0,0,0,0]}

df1=pd.DataFrame(data)

The expected output is this.

data={"ID":[1,1,1,1,1,1,1,1,1,2,2,2],
"Year":[2000,2001,2002,2003,2004,1997,1998,2003,2004,1997,1998,2005],
"Firm":["A","A","B","B","A","A","A","A","B","B","A","A"],
"Count":[0,1,0,0,0,0,0,0,0,0,0,0],
"Count_1":[0,1,1,1,1,0,0,1,1,0,0,0]}

df2=pd.DataFrame(data)

I can achieve the expected output by my code.

df_1=df1.sort_values(by=["ID","Year"],ascending=True)
df_1["Count_1"]=np.where(df_1["Count"]==1,1,np.NaN)
df_1["Count_1"]=df_1.groupby(["ID"],as_index=None)["Count_1"].ffill()
df_1.drop(columns=["Count"],inplace=True)
df_1.fillna(0)

However, I am looking for a shorter and cleaner code.

Why is pandas.Series.std() different from numpy.std()?

This is what I am trying to explain:

>>> a = pd.Series([7, 20, 22, 22])
>>> a.std()
7.2284161474004804
>>> np.std(a)
6.2599920127744575

There is a related discussion here, but their suggestions do not work either.

I have data about many different restaurants. Here is my dataframe (imagine more than one restaurant, but the effect is reproduced with just one):

>>> df
    restaurant_id  price
id                      
1           10407      7
3           10407     20
6           10407     22
13          10407     22

Question: r.mi.groupby('restaurant_id')['price'].mean() returns price means for each restaurant. I want to get the standard deviations. However, r.mi.groupby('restaurant_id')['price'].std() returns wrong values.

As you can see, for simplicity I have extracted just one restaurant with four items. I want to find the standard deviation of the price. Just to make sure:

>>> np.mean(a)
17.75
>>> np.std(a)
6.2599920127744575

We can get the same (correct) values with

>>> np.mean(df)
restaurant_id    10407.00
price               17.75
dtype: float64
>>> np.std(df)
restaurant_id    0.000000
price            6.259992
dtype: float64

(Of course, disregard the mean restaurant id.) Obviously, np.std(df) is not a solution when I have more than one restaurant. So I am using groupby.

>>> df.groupby('restaurant_id').agg('std')
                  price
restaurant_id          
10407          7.228416

What?! 7.228416 is not 6.259992.

However, this works:

for id, group in df.groupby('restaurant_id'):
  print id, np.std(group['price'])

Question: is there a proper way to aggregate the dataframe, so I will get a new time series with the standard deviations for each restaurant?

Why does groupby.transform function take a dataframe as the input for the last function call in the first group loop?



# Generating a test dataframe
def get_data(stack=False):
    dic = {'APPL':[10,5,-4,8],'MSFT':[20,-9,12,4],'TSLA':[-8,15,-7,14]}
    df = pd.DataFrame(dic,index=pd.Series([2010,2011,2012,2013],name='date'))
    if stack:
        df = pd.DataFrame(df.stack()).reset_index()
        df.columns =['date','symbol','ret']
    return df

# Setting examples
dfs = get_data(stack=True).copy()


def f(x):
    global i
    print("-----------------------")
    print("Function call:", i,'\n')
    print("This is an input of function\n",x,'\n')
    i = i + 1

i = 1
dfs.groupby('date').transform(f)

I wanted to see what the inputs of a user-defined function are when I use a groupby transform method. As expected the first 2 inputs are columns (symbol, ret) of the first group (2010). But I don't see why the third input is the entire dataframe of the first group. So. the total number of function calls is 9, not 8 (= 4 groups times 2 columns). I got the following result.

---- stacked data ---- 
     date symbol  ret
0   2010   APPL   10
1   2010   MSFT   20
2   2010   TSLA   -8
3   2011   APPL    5
4   2011   MSFT   -9
5   2011   TSLA   15
6   2012   APPL   -4
7   2012   MSFT   12
8   2012   TSLA   -7
9   2013   APPL    8
10  2013   MSFT    4
11  2013   TSLA   14

======= Transform method ========
-----------------------
Function call: 1 

This is an input of function
 0    APPL
1    MSFT
2    TSLA
Name: symbol, dtype: object 

-----------------------
Function call: 2 

This is an input of function
 0    10
1    20
2    -8
Name: ret, dtype: int64 

-----------------------
Function call: 3 

This is an input of function
   symbol  ret
0   APPL   10
1   MSFT   20
2   TSLA   -8 

-----------------------
Function call: 4 

This is an input of function
 3    APPL
4    MSFT
5    TSLA
Name: symbol, dtype: object 

-----------------------
Function call: 5 

This is an input of function
 3     5
4    -9
5    15
Name: ret, dtype: int64 

-----------------------
Function call: 6 

This is an input of function
 6    APPL
7    MSFT
8    TSLA
Name: symbol, dtype: object 

-----------------------
Function call: 7 

This is an input of function
 6    -4
7    12
8    -7
Name: ret, dtype: int64 

-----------------------
Function call: 8 

This is an input of function
 9     APPL
10    MSFT
11    TSLA
Name: symbol, dtype: object 

-----------------------
Function call: 9 

This is an input of function
 9      8
10     4
11    14
Name: ret, dtype: int64

TSQL GroupBy value every X seconds

I have this query:

DECLARE @NUM AS INT --IT INDICATE THE NUMBER OF SECONDS TO GROUP THE DATA
DECLARE @DATA_START AS DATETIME = DATEADD(yy, DATEDIFF(yy, 0, GETDATE()), 0)
DECLARE @DATA_END AS DATETIME = GETDATE()

SELECT AVG(D.VALORE)
FROM AA_V_DATI_MISURAZIONE D
WHERE D.STARTDATE >= @DATA_START AND D.ENDDATE <= @DATA_END

Now I need to extra the average of "D.Valore" group by every @NUM seconds.

Creating a variable which indicates if a target date is between two dates, but there are several rows per case

I am trying to create a new variable called positive which indicates that the targ_date is between a date range between st_date and en_date, but there are several rows per ID. I am trying to create something like the variable positive. For instance if between a date range, the target date is within, then that date range should be assigned a 1, otherwise a 0. Here is a code to reproduce what I need.

ID<-c("4674","4674","4674","4674","4674","4674","4674",   "25694","25694","25694","25694","25694","25694","25694", "92252","92252","92252","92252","92252") 

st_date<-c(NA,NA,"2022-05-09",NA,NA,"2013-07-01","2013-11-27", NA,NA,NA,NA,NA,"2011-04-14","2011-08-04", NA,NA,"2015-08-07","2017-08-18",NA) 

en_date<-c(NA,NA,"2022-06-10",NA,NA,"2013-08-01","2013-12-27", NA,NA,NA,NA,NA,"2011-06-13","2011-08-15", NA,NA,"2015-09-09","2017-09-18",NA) 

targ_date<-c("2022-05-15","2013-05-01",NA,NA,NA,"2013-06-30",NA, NA,NA,NA,NA,NA,"2011-06-13","2011-07-15", NA,NA,"2015-08-20","2017-08-23",NA) 

positive<-c(0,0,1,0,0,0,0, 0,0,0,0,0,1,0, 0,0,1,0,0) 

data<-data.frame(ID,st_date,en_date,targ_date,positive)

I have tried:

ds <- ds %>% 
group_by(ID) %>% 
mutate( between_any = as.numeric(inrange(targ_date, st_date, en_date)) ) %>% 
ungroup()

but it does not do what I want.

Write a query to find the full names of customers who have rented sci-fi movies more than 5 times. Arrange these names in the alphabetical order

Hello below is my Code

DB: https://dev.mysql.com/doc/sakila/en/sakila-structure.html

select concat(first_name,' ',last_name) from
customer where customer_id in (
select customer_id from (
select customer_id, count(rental_id) as num
from 
category 
inner join film_category using(category_id) 
inner join film using(film_id) 
inner join inventory using(film_id) 
inner join rental using (inventory_id)
where name='Sci-Fi'
group by customer_id, rental_id)
where num > 5)T)

when i am executing i am getting the below error

ERROR 1248 (42000) at line 2: Every derived table must have its own alias

Expected Outcome is "full names of customers who have rented sci-fi movies more than 5 times. Arrange these names in the alphabetical order"

Could you please let me know what is the mistake i am doing?

How can I combine groupby and apply method for a loop function

I have dataset with three columns -"unique id”, “Period“, “flag1”. This dataset has around 800K+ rows with many unique ID's. I am generating a Flag2 as shown in result (for one ID only) . I would like to know:

how I can combine the below loop with groupby and apply method for each unique ID as shown in sample dataset.
As loop function on every row is very slow. Is there any alternative method to get the same result for each unique ID with better performance.

Sample dataset: 
ID  Period   Flag1
1   Jan-21  0
2   Jan-21  0
1   Feb-21  1
2   Feb-21  0
1   Mar-21  1
2   Mar-21  0
1   Apr-21  0
2   Apr-21  0
1   May-21  1
2   May-21  0
1   Jun-21  0
2   Jun-21  0
1   Jul-21  0
2   Jul-21  1
1   Aug-21  0
2   Aug-21  0
1   Sep-21  1
2   Sep-21  0
1   Oct-21  0
2   Oct-21  1
1   Nov-21  1
2   Nov-21  0
1   Dec-21  1
2   Dec-21  1
1   Jan-22  0
2   Jan-22  1
1   Feb-22  0
2   Feb-22  0
1   Mar-22  0
2   Mar-22  0
1   Apr-22  1
2   Apr-22  0
1   May-22  0
2   May-22  0
1   Jun-22  0
2   Jun-22  0
1   Jul-22  0
2   Jul-22  0
1   Aug-22  0
2   Aug-22  0
1   Sep-22  0
2   Sep-22  0
1   Oct-22  0
2   Oct-22  0
1   Nov-22  0
2   Nov-22  0
1   Dec-22  0
2   Dec-22  1
1   Jan-23  0
2   Feb-23  0
1   Mar-23  0
2   Apr-23  0
1   May-23  0
2   May-23  0
1   Jun-23  0

Result:
Id Period Flag1 Flag2(Out)
1 Jan-21   0     0
1 Feb-21   1     1
1 Mar-21   0     1
1 Apr-21   0     1
1 May-21   1     1
1 Jun-21   0     1
1 Jul-21   1     1
1 Aug-21   1     1
1 Sep-21   0     1
1 Oct-21   1     1
1 Nov-21   1     1
1 Dec-21   0     1
1 Jan-22   0     1
1 Feb-22   0     0
1 Mar-22   0     0
1 Apr-22   1     1
1 May-22   0     1
1 Jun-22   0     1
1 Jul-22   1     1
1 Aug-22   0     1
1 Sep-22   0     1
1 Oct-22   1     1
1 Nov-22   1     1
1 Dec-22   0     1
1 Jan-23   0     1
1 Feb-23   0     1
1 Mar-23   0     1
1 Apr-23   0     0
1 May-23   0     0
1 Jun-23   0     0

   df['Flag2'] = 0
    i = 0
    while i < len(df):
        if df.loc[i, 'Flag1'] == 1:
            df.loc[i:i+11, 'Flag2'] = 1
            i += 11
        i += 1

I am trying to use groupby and apply method for the above loop but getting an error.

MySQL - Use of Group BY clause

I am using the Sakila Database for practice.

I am calculating - Sales by Store. It provides a list of total sales, broken down by store. It incorporates data from the city, country, payment, rental, inventory, store, address, and staff tables

I have written a query, but it is throwing an error due to not using the non-aggregated column in the GROUP BY Clause. However, the View DDL statement also does something similar but does not give any error. If anyone can explain to me why is this happening - it would be very useful.

DDL Statement for the View (Which is working fine):

SELECT 
    CONCAT(`c`.`city`, _utf8mb4 ',', `cy`.`country`) AS `store`,
    CONCAT(`m`.`first_name`, _utf8mb4 ' ', `m`.`last_name`) AS `manager`,
    SUM(`p`.`amount`) AS `total_sales` 
FROM 
    (((((((`sakila`.`payment` `p` 
    INNER JOIN `sakila`.`rental` `r` ON (`p`.`rental_id` = `r`.`rental_id`)
    ) 
    INNER JOIN `sakila`.`inventory` `i` ON (`r`.`inventory_id` = `i`.`inventory_id`)
    ) 
    INNER JOIN `sakila`.`store` `s` ON (`i`.`store_id` = `s`.`store_id`)
    ) 
    INNER JOIN `sakila`.`address` `a` ON (`s`.`address_id` = `a`.`address_id`)
    ) 
    INNER JOIN `sakila`.`city` `c` ON (`a`.`city_id` = `c`.`city_id`)
    ) 
    INNER JOIN `sakila`.`country` `cy` ON (`c`.`country_id` = `cy`.`country_id`)
    ) 
    INNER JOIN `sakila`.`staff` `m` ON (`s`.`manager_staff_id` = `m`.`staff_id`)
)
GROUP BY `s`.`store_id`
ORDER BY `cy`.`country`, `c`.`city`;

My Query -

SELECT 
    cont.country, 
    SUM(pay.amount) as "Total Sales"
FROM payment as pay
INNER JOIN staff as staff ON pay.staff_id = staff.staff_id 
INNER JOIN store as store ON staff.store_id = store.store_id
INNER JOIN address as ad ON store.address_id = ad.address_id
INNER JOIN city as City ON ad.city_id = city.city_id
INNER JOIN country as cont ON city.country_id = cont.country_id
INNER JOIN rental as rent ON pay.rental_id = rent.rental_id
INNER JOIN inventory as inven ON rent.inventory_id = inven.inventory_id
GROUP BY inven.store_id;

The Output desired is revenue by country, but when I do that the output is coming wrong, we have to group by store_ID as in DDL Statement

Resample and aggregation functions together

I have dataframe like below. Actually this is tick data from stock exchange.

                      price  quantity
date_time                            
2023-07-21 10:00:02  170.41        71
2023-07-21 10:00:20  170.68       200
2023-07-21 10:00:31  170.76        23
2023-07-21 10:00:51  170.44       139
2023-07-21 10:01:36  170.41         2
2023-07-21 10:01:17  170.48        42
2023-07-21 10:01:22  170.45         1
2023-07-21 10:01:41  170.10        10

As a result I want to get(I think step form would describe better):

resample data by 5 min intervals
group 'quatity' by 'price'
find max 'quantity' and it`s index (some exact datetime value)
create dataframe like below in table

| date_time_resampled | date_time                |price|quantity|
|2023-07-21 10:00:00  | some exact datetime value|     |        |
|2023-07-21 10:05:00  | 2023-07-21 10:00:31      |     |        |
|2023-07-21 10:10:00  | some exact datetime value|     |        |

I tried resample and groupby functions but I can`t combine them together.

Normal view