Using IF THEN function to calculate a rolling number in a dataframeTrav
I am in need of help trying to create a calculated metric. I am trying to create an RSI calculation for a stock dataset. To do so I want to look at the last 14 days and find the average gain for all days where the stock price was up. I then will have to do the same for all days the stock market is down. I am doing these calculations across multiple stocks, so I created a dictionary which I then concatenate. Here is the code: stocklist=["^SPX", "^DJI"] d={} def averageGain14(dailyGain):
15 May 2024 at 21:20

Using IF THEN function to calculate a rolling number in a dataframe

I am in need of help trying to create a calculated metric. I am trying to create an RSI calculation for a stock dataset. To do so I want to look at the last 14 days and find the average gain for all days where the stock price was up. I then will have to do the same for all days the stock market is down. I am doing these calculations across multiple stocks, so I created a dictionary which I then concatenate. Here is the code:

stocklist=["^SPX", "^DJI"]
d={}
def averageGain14(dailyGain):
    if dailyGain>= 0:
        gain = dailyGain
    return gain
for name in stocklist:
    d[name]= pd.DataFrame()
    data = yf.Ticker(name)
    data = data.history(start=myStart, end=myEnd)
    d[name]= pd.DataFrame(data)
    d[name]["Daily Gain"]=d[name]["Close"].diff()
    d[name]['Average Gain'] = d[name]["Daily Gain"].apply(averageGain14)
    d[name] = d[name].add_prefix(name)
modelData = pd.concat(d.values(), axis=1)

As you can see, I try to define a function for averagegain14 at the top, which is not currently doing anything yet but returning the gain value if the day was up (step 1 of getting this working). In the For loop, I am trying to set the "Average Gain" Column to a calculated field that applies the function to the "Daily Gain" column, but I seem to be running into an error.

I tried a few approaches, but to no avail. First I tried d[name]['Average Gain'] = d[name].rolling(14).mean().where(d[name]['Daily Gain'] >= 0, 0)

That returned an error regarding the Daily Gain value being a list and not a single value. I then tried appending the daily gain call with .values, but that didn't work either. I then tried this approach above that is not working. I think to add complexity, I need this to also be a rolling average based on the last 14 days, so to not only calculate add up the positive days, but to also then find the average gain for those days (know the denominator of how many days were up in the 14 day window). Hopefully this is making sense and someone can point me in the right direction.

dataframe: count multiple occurrences across all columns and output dataframe with same columns, single occcurrances as indexes

I have a Pandas dataframe like this:

>>> df = pd.DataFrame({'2012':['A','A','B','A'],'2013':['A','B','C','C'],'2014':['A','C','Z','C']})
>>> df

  2012 2013 2014
0    A    A    A
1    A    B    C
2    B    C    Z
3    A    C    C

From it, I need to create another dataframe like this:

   2012  2013  2014
A     3     1     1
B     1     1     0
C     0     2     2

where I am basically counting some (A,B,C but not Z) of the occurrences of the labels in every column, turning them into indexes, and showing their count per year.

I did come out with a solution that involves iteration:

>>> indexes = ['A','B','C']
>>> for idx in indexes:
        df2.loc[idx] = (df == idx).sum()
>>> df2

   2012  2013  2014
A     3     1     1
B     1     1     0
C     0     2     2

This outputs exactly what I need. But I wonder, is there a way to do it in one shot without iteration?

I played around with values_counts(), pivot_table() and groupby() without success. All Google searches I found point to this type of count but across one column only.

Thanks in advance to whoever may help!

Why does filtering based on a condition results in an empty DataFrame in pandas?

I'm working with a DataFrame in Python using pandas, and I'm trying to apply multiple conditions to filter rows based on temperature values from multiple columns. However, after applying my conditions and using dropna(), I end up with zero rows even though I expect some data to meet these conditions.

The goal is compare with Ambient temp+40 C and if the value is more than this, replace it with NaN. Otherwise, keep the original value.

Here's a sample of my DataFrame and the conditions I'm applying:

data = {
    'Datetime': ['2022-08-04 15:06:00', '2022-08-04 15:07:00', '2022-08-04 15:08:00', 
                 '2022-08-04 15:09:00', '2022-08-04 15:10:00'],
    'Temp1': [53.4, 54.3, 53.7, 54.3, 55.4],
    'Temp2': [57.8, 57.0, 87.0, 57.2, 57.5],
    'Temp3': [59.0, 58.8, 58.7, 59.1, 59.7],
    'Temp4': [46.7, 47.1, 80, 46.9, 47.3],
    'Temp5': [52.8, 53.1, 53.0, 53.1, 53.4],
    'Temp6': [50.1, 69, 50.3, 50.3, 50.6],
    'AmbientTemp': [29.0, 28.8, 28.6, 28.7, 28.9]
}
df1 = pd.DataFrame(data)
df1['Datetime'] = pd.to_datetime(df1['Datetime'])
df1.set_index('Datetime', inplace=True)

Code:

temp_cols = ['Temp1', 'Temp2', 'Temp3', 'Temp4', 'Temp5', 'Temp6']
ambient_col = 'AmbientTemp'

condition = (df1[temp_cols].lt(df1[ambient_col] + 40, axis=0))

filtered_df = df1[condition].dropna()
print(filtered_df.shape)

Response:

(0, 99)

Problem:

Despite expecting valid data that meets the conditions, the resulting DataFrame is empty after applying the filter and dropping NaN values. What could be causing this issue, and how can I correct it?

Remove observations within 7 days of each other within specific ID group

I have a pandas dataframe with ID and dates like this:

ID	Date
111	16/09/2021
111	14/03/2022
111	18/03/2022
111	21/03/2022
111	22/03/2022
222	27/03/2022
222	30/03/2022
222	4/04/2022
222	6/04/2022
222	13/04/2022

For each ID, I would like to filter the table and remove observations that are within 7 days of each other. But I want to keep the earliest date of the dates that are within 7 days of each other so that each ID will have unique dates that are more than 7 days apart and do not contain other dates in between:

ID	Date
111	16/09/2021
111	14/03/2022
111	22/03/2022
222	27/03/2022
222	4/04/2022
222	13/04/2022

I'm quite new to python and pandas dataframe so hoping someone can assist and provide some pointers. There is a similar SO question How do I remove observations within 7 days of each other within a specific ID group? but this was done in R so hoping there is something similar that can be done with Pandas.

List Dict to rows/columns in dataframe

I'm a complete beginner at python, and I'm trying to convert a json payload into a dataframe.

So I've got data that looks like this:

data = {
    "results": [
        {"values": [{"name": "firstname","value": "Fname1"},{"name": "lastname","value": "Lname1"}]},
        {"values": [{"name": "firstname","value": "Fname2"},{"name": "lastname","value": "Lname2"}]},
        {"values": [{"name": "firstname","value": "Fname3"},{"name": "lastname","value": "Lname3"}]}
    ]
}

I want a dataframe that looks like this:

firstname	lastname
Fname1	Lname1
Fname2	Lname2
Fname3	Lname3

I'm trying to figure out how to do this without doing something like a for loop or manually assigning column names. Thanks!

Kinda tried something like this:

df = pd.DataFrame(data["results"])

df[['values'][0]].apply(lambda x: x[0]['value'])

which result in:


Fname1
Fname2
Fname3

But I'm kinda stuck on what to do next.

how to run a chi-square goodness of fit test on fitting discrete probability distributions in Python

I am trying to test the fit of several probability distributions on my data and perform a Maximum Likelihood estimation and KS test on the fit of each probability distribution to my data. My code works for continuous probability distributions but not for discrete (because they do not have a .fit object. To fix the discrete distributions I am attempting to use a chi-square test but I cannot get it to work.

import pandas as pd
import numpy as np
import scipy.stats as st
from scipy.stats import kstest

data = pd.read_csv(r'...\demo_data.csv')
data = pd.DataFrame(data)

#continuous variables
continuous_results = []

for cvar in ["cvar1", "cva2", "cvar3"]:
  #probability distributions
  cdistributions = [
      st.arcsine,
      st.alpha,
      st.beta,
      st.betaprime,
      st.bradford,
      st.burr,
      st.chi,
      st.chi2,
      st.cosine,
      st.dgamma,
      st.dweibull,
      st.expon,
      st.exponweib,
      st.exponpow,
      st.genlogistic,
      st.genpareto,
      st.genexpon,
      st.gengamma,
      st.genhyperbolic,
      st.geninvgauss,
      st.gennorm,
      st.f,
      st.gamma,
      st.invgamma,
      st.invgauss,
      st.invweibull,
      st.laplace,
      st.logistic,
      st.loggamma,
      st.loglaplace,
      st.loguniform,
      st.nakagami,
      st.norm,
      st.pareto,
      st.powernorm,
      st.powerlaw,
      st.rdist,
      st.semicircular,
      st.t,
      st.trapezoid,
      st.triang,
      st.tukeylambda,
      st.uniform,
      st.wald,
      st.weibull_max,
      st.weibull_min
  ]

  for distribution in cdistributions:
    try:
      #fit each probability distribution to each variable in the data
      pars = distribution.fit(data[cvar])
      mle = distribution.nnlf(pars, data[cvar])
      
      #perform ks test
      ks_result = kstest(data[cvar], distribution.cdf, args = pars)
      
      #create dictionary to store results for each variable/distribution
      result = {
        "variable": cvar,
        "distribution": distribution.name,
        "type": "continuous",
        "mle": mle,
        "ks_stat": ks_result.statistic,
        "ks_pvalue": ks_result.pvalue
      }
      continuous_results.append(result)
    except Exception as e:
      # assign 0 for error
      mle = 0

this code isn't running a chi-square test or ks test and I am unsure how to fix it:

#discrete variables
discrete_results = []

for var in ["dvar1", "dvar2"]:
  #probability distributions
  distributions = [
      st.bernoulli, 
      st.betabinom, 
      st.binom,
      st.boltzmann,
      st.poisson,
      st.geom,
      st.nbinom,
      st.hypergeom,
      st.zipf,
      st.zipfian,
      st.logser,
      st.randint,
      st.dlaplace
  ]

  for distribution in distributions:
    try:
      #fit each probability distribution to each variable in the data
      distfit = getattr(st, distribution.name)
      chisq_stat, pval = st.chisquare(data[var], f = distfit.pmf(range(len(data[var]))))
      
      #perform ks test
      ks_result = kstest(data[var], distribution.cdf, args = distfit.pmf(range(len(data[var]))))
      
      #create dictionary to store results for each variable/distribution
      result = {
        "variable": var,
        "distribution": distribution.name,
        "type": "continuous",
        "chisq": chisq_stat,
        "pvalue": pval,
        "ks_stat": ks_result.statistic,
        "ks_pvalue": ks_result.pvalue
      }
      discrete_results.append(result)
    except Exception as e:
      # assign 0 for error
      chisq_stat = 0

Reading Data in Python using pandas

import pandas as pd
import sklearn
from sklearn.datasets import load_iris
Loading data from a CSV file
data = pd.read_csv('D:/Projects/FLGRU_Model/FLDataset/01-12/DrDoS_LDAP.csv')
df = pd.read_csv(data)
Performing data analysis
df.head()  # Display the first few rows
df.describe()  # Statistical summary of the data

What could be the problem with the following code,? its not reading the data

I was reading the data and te code is giving the following errors

 DtypeWarning: Columns (85) have mixed types. Specify dtype option on import or set low_memory=False.
  data = pd.read_csv('D:/Projects/FLGRU_Model/FLDataset/01-12/DrDoS_LDAP.csv')
Traceback (most recent call last):

How to remove or hide x-axis labels from a seaborn / matplotlib plot

I have a boxplot and need to remove the x-axis ('user_type' and 'member_gender') label. How do I do this given the below format?

sb.boxplot(x="user_type", y="Seconds", data=df, color = default_color, ax = ax[0,0], sym='').set_title('User-Type (0=Non-Subscriber, 1=Subscriber)')
sb.boxplot(x="member_gender", y="Seconds", data=df, color = default_color, ax = ax[1,0], sym='').set_title('Gender (0=Male, 1=Female, 2=Other)')

Row based filter and aggreation in pandas python

I have two dataframe as below

df1:

data1 = {
    'Acc': [1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 3, 3, 3, 3, 4],
    'indi_val': ['Val1', 'val2', 'Val_E', 'Val1_E', 'Val1', 'Val3', 'val2', 'val2_E', 'val22_E', 'val2_A', 'val2_V', 'Val_E', 'Val_A', 'Val', 'Val2', 'val7'],
    'Amt': [10, 20, 5, 5, 22, 38, 15, 25, 22, 23, 24, 56, 67, 45, 87, 88]
}
df1 = pd.DataFrame(data1)

df2:

data2 = {
    'Acc': [1, 1, 2, 2, 3, 4],
    'Indi': ['With E', 'Without E', 'With E', 'Without E', 'Normal', 'Normal']
}
df2 = pd.DataFrame(data2)

Based on these two dataframe I need to create final output as below:

 AccNo   Indi      Amt
 1      With E      7
 1      Without E   90
 2      With E      47
 2      Without E   62
 3      Normal      225
 4      Normal      88

Logic:

with E: from df1 where indi_val value last two charecters should be equal to "_E" then sum(Amt).
Without E: from df1 where indi_val value last two charecters shouldn't be equal to "_E" then Sum(Amt).
Noraml: without any filter on df1 indi_val column need to sum(Amt) Value

I tried writing something as below:


def get_indi(row):
    listval = []
    if row['Indi'] == "With E":
        #print('A')
        df1.apply(lambda df1row: listval.append(df1row['amt'] if df1row['Acc']==row['Acc'] and df1row['indi_val'][-2:]=="_E" else 0))
    
    if row['Indi'] == "Without E":
        df1.apply(lambda df1row: listval.append(df1row['amt'] if df1row['Acc']==row['Acc'] and df1row['indi_val'][-2:]!="_E" else 0))
    
    if row['Indi'] == "Normal":
        df1.apply(lambda df1row: listval.append(df1row['amt']))
        
    return sum(listval)

# Apply the function to create the 'Indi' column in df1
df2['Amt'] = df2.apply(get_indi)

With above code I am getting an error that

get_loc raise KeyError(key) KeyError: 'Indi'

Move specific column information to a new row under the current row

Consider this df:

data = { 'Name Type': ["Primary", "Primary", "Primary"],
         'Full Name': ["John Snow", "Daenerys Targaryen", "Brienne Tarth"], 
         'AKA': ["Aegon Targaryen", None, None],
         'LQAKA': ["The Bastard of Winterfell", "Mother of Dragons", None],
         'Other': ["Info", "Info", "Info"]}
df = pd.DataFrame(data)

I need to move akas and lqakas if they are not None below each Primary name and also assign the Name Type to be AKA or LQAKA. If it is None, no row should be created. There are many other columns like column other that should keep info in the same row as Primary name. So the expected result would be:

Name Type	Full Name	Other
Primary	John Snow	Info
AKA	Aegon Targaryen
LQAKA	The Bastard of Winterfell
Primary	Daenerys Targaryen	Info
LQAKA	Mother of Dragons
Primary	Brienne Tarth	Info

Strange 'Currency Rates source not ready' forex_python error

I'm trying to understand forex API through python.The code that I am posting below worked for me on friday and I received all the conversion rates for the dates as desired. Strangely when I run the code today for some reason it says

Currency Rates source not ready.

Why is this happening?

from forex_python.converter import CurrencyRates
import pandas as pd
c = CurrencyRates()
from forex_python.converter import CurrencyRates
c = CurrencyRates()


df = pd.DataFrame(pd.date_range(start='8/16/2021 10:00:00', end='8/22/2021 11:00:00', freq='600min'), columns=['DateTime'])

def get_rate(x):
    try:
        op = c.get_rate('CAD', 'USD', x)
    except Exception as re:
        print(re)
        op=None
    return op

df['Rate'] = df['DateTime'].apply(get_rate)

Currency Rates Source Not Ready
Currency Rates Source Not Ready

df
Out[17]: 
              DateTime      Rate
0  2021-08-16 10:00:00  0.796374
1  2021-08-16 20:00:00  0.796374
2  2021-08-17 06:00:00  0.793031
3  2021-08-17 16:00:00  0.793031
4  2021-08-18 02:00:00  0.792469
5  2021-08-18 12:00:00  0.792469
6  2021-08-18 22:00:00  0.792469
7  2021-08-19 08:00:00  0.783967
8  2021-08-19 18:00:00  0.783967
9  2021-08-20 04:00:00  0.774504
10 2021-08-20 14:00:00  0.774504
11 2021-08-21 00:00:00       NaN
12 2021-08-21 10:00:00       NaN
13 2021-08-21 20:00:00       NaN
14 2021-08-22 06:00:00       NaN

How do I fix this issue? Is there a way to ignore NaN while making calls itself? I feel that the API only gives results for Monday to Friday from 10 am to 5pm. So is there a way to just get those results.

Apply transformation only on string columns with Pandas, ignoring numeric data

So, I have a pretty large dataframe with 85 columns and almost 90,000 rows and I wanted to use str.lower() in all of them. However, there are several columns containing numerical data. Is there an easy solution for this?

> df

    A   B   C
0   10  John    Dog
1   12  Jack    Cat
2   54  Mary    Monkey
3   23  Bob     Horse

Then, after using something like df.applymap(str.lower) I would get:

> df

    A   B   C
0   10  john    dog
1   12  jack    cat
2   54  mary    monkey
3   23  bob     horse

Currently it's showing this error message:

TypeError: descriptor 'lower' requires a 'str' object but received a 'int'

Pandas KeyError: value not in index

I have the following code,

df = pd.read_csv(CsvFileName)

p = df.pivot_table(index=['Hour'], columns='DOW', values='Changes', aggfunc=np.mean).round(0)
p.fillna(0, inplace=True)

p[["1Sun", "2Mon", "3Tue", "4Wed", "5Thu", "6Fri", "7Sat"]] = p[["1Sun", "2Mon", "3Tue", "4Wed", "5Thu", "6Fri", "7Sat"]].astype(int)

It has always been working until the csv file doesn't have enough coverage (of all week days). For e.g., with the following .csv file,

DOW,Hour,Changes
4Wed,01,237
3Tue,07,2533
1Sun,01,240
3Tue,12,4407
1Sun,09,2204
1Sun,01,240
1Sun,01,241
1Sun,01,241
3Tue,11,662
4Wed,01,4
2Mon,18,4737
1Sun,15,240
2Mon,02,4
6Fri,01,1
1Sun,01,240
2Mon,19,2300
2Mon,19,2532

I'll get the following error:

KeyError: "['5Thu' '7Sat'] not in index"

It seems to have a very easy fix, but I'm just too new to Python to know how to fix it.

astype does not work for Pandas dataframe

I have a dataframe called messages where data looks like

message             length  class
hello, Come here      16     A
hi, how are you       15     A 
what is it            10     B
maybe tomorrow        14     A

When i do

messages.dtypes

It shows me

class      object
message    object
Length      int64
dtype: object

Then I tried converting message column to string type

messages['message'] = messages['message'].astype(str)
print messages.dtypes

It still shows me

class      object
message    object
Length      int64
dtype: object

What am I doing wrong. Why doesn't it convert to string?

Python version 2.7.9 On windows 10
Pandas version 0.15.2

How do i iterate over a pandas groupby object with 2 columns and access both columns?

So i am currently using pandas dataframes to do data processing from a CSV File. there are two columns and 30k entries or so.

there are duplicates of the first column with different values in the second column. i need to access both the first columns string and the second columns string.

the first column is a link and the second column is a security flag, but there are many of the same links with different flags that i have to be able to access and decide to keep or change the flag.

i have tried to iterate over the groupby object in pandas but i can only access the value i group by with the first loop variable. the second variable for the group i can not use to get the second corresponding column.

CODE BELOW:

for name, group in grouped:
        link = str(name)
        flag = group['flag']
        print(flag.str.contains('no_flag'))

i expected the get either 'no_flag' from the flag var or one of the flags.

How to find mean and median of census data using python

I am new to python data analysis and I am working on census data, here is an example of the data,

| geo_area |Total | 2-5km|5-10km| |------------|------| ----| -----| | E02000001 | 5378 | 385 | 241| | E02000002 | 3238 | 474 |394 | | E02000003 | 5238 | 603 | 541| | E02000004 | 3113 | 354 |277 | | E02000005 | 4862 | 684 | 532| | E02000006 | 4271 | 676 |408 |
Where the first column is the geographical area, the second column is total distance travelled to work and the remaining columns are distances between 2-5km and 5-10 km. I excluded the other columns (10-20 km, 20-30 km, 30-40 km, 40-60 km, over 60 km and finally 0 km (working from home)) because it was too long. Also I've shown only 6 rows but the data is over 7000 rows.

The values of columns 2-5 km and 5-10km are frequencies.

I would like to determine both the average and median distances travelled for each geographical area?

I am not too sure if this is correct, but for the median, I wrote the following definition:

def median_calculator(df):
    full_list_of_numbers = []
    for col in df.columns:
        full_list_of_numbers.append(((df[col].cumsum() - (df[col].sum(axis=0)/2).T) < 0).sum())
    return full_list_of_numbers

I am struggling with the mean calculation and don't even know if the median above is a correct application. Also, can I use df.groupby('geographical_areas').agg({mean_distance: 'mean, 'median_distance:'median}) or is that incorrect?

Here is a link to the data if anyone is interested: It is the file census2021-ts058-msoa in the zip file (TS058 Distance travelled to work under Work and Travel.

I appreciate the help, really.

Stacked bar chart from dataframe

Program

Here's a small Python program that gets tax data via the treasury.gov API:

import pandas as pd
import treasury_gov_pandas
# ----------------------------------------------------------------------
df = treasury_gov_pandas.update_records(
    url = 'https://api.fiscaldata.treasury.gov/services/api/fiscal_service/v1/accounting/dts/deposits_withdrawals_operating_cash')

df['record_date'] = pd.to_datetime(df['record_date'])

df['transaction_today_amt'] = pd.to_numeric(df['transaction_today_amt'])

tmp = df[(df['transaction_type'] == 'Deposits') &   ((df['transaction_catg'].str.contains('Tax'))   |   (df['transaction_catg'].str.contains('FTD')))   ]

The program is using the following library to download the data:

https://github.com/dharmatech/treasury-gov-pandas.py

Dataframe

Here's what the resulting data looks like:

>>> tmp.tail(20).drop(columns=['table_nbr', 'table_nm', 'src_line_nbr', 'record_fiscal_year', 'record_fiscal_quarter', 'record_calendar_year', 'record_calendar_quarter', 'record_calendar_month', 'record_calendar_day', 'transaction_mtd_amt', 'transaction_fytd_amt', 'transaction_catg_desc', 'account_type', 'transaction_type'])

       record_date                          transaction_catg  transaction_today_amt
371266  2024-04-03    DHS - Customs and Certain Excise Taxes                     84
371288  2024-04-03                  Taxes - Corporate Income                    237
371289  2024-04-03                   Taxes - Estate and Gift                     66
371290  2024-04-03       Taxes - Federal Unemployment (FUTA)                     10
371291  2024-04-03  Taxes - IRS Collected Estate, Gift, misc                     23
371292  2024-04-03              Taxes - Miscellaneous Excise                     41
371293  2024-04-03  Taxes - Non Withheld Ind/SECA Electronic                   1786
371294  2024-04-03       Taxes - Non Withheld Ind/SECA Other                   2315
371295  2024-04-03               Taxes - Railroad Retirement                      3
371296  2024-04-03          Taxes - Withheld Individual/FICA                  12499
371447  2024-04-04    DHS - Customs and Certain Excise Taxes                     82
371469  2024-04-04                  Taxes - Corporate Income                    288
371470  2024-04-04                   Taxes - Estate and Gift                     59
371471  2024-04-04       Taxes - Federal Unemployment (FUTA)                      8
371472  2024-04-04  Taxes - IRS Collected Estate, Gift, misc                    127
371473  2024-04-04              Taxes - Miscellaneous Excise                     17
371474  2024-04-04  Taxes - Non Withheld Ind/SECA Electronic                   1905
371475  2024-04-04       Taxes - Non Withheld Ind/SECA Other                   1092
371476  2024-04-04               Taxes - Railroad Retirement                      1
371477  2024-04-04          Taxes - Withheld Individual/FICA                   2871

The dataframe has data that goes back to 2005:

>>> tmp.drop(columns=['table_nbr', 'table_nm', 'src_line_nbr', 'record_fiscal_year', 'record_fiscal_quarter', 'record_calendar_year', 'record_calendar_quarter', 'record_calendar_month', 'record_calendar_day', 'transaction_mtd_amt', 'transaction_fytd_amt', 'transaction_catg_desc', 'account_type', 'transaction_type'])

       record_date                                   transaction_catg  transaction_today_amt
2       2005-10-03                   Customs and Certain Excise Taxes                    127
7       2005-10-03                              Estate and Gift Taxes                     74
10      2005-10-03                          FTD's Received (Table IV)                   2515
12      2005-10-03  Individual Income and Employment Taxes, Not Wi...                    353
21      2005-10-03                          FTD's Received (Table IV)                  15708
...            ...                                                ...                    ...
371473  2024-04-04                       Taxes - Miscellaneous Excise                     17
371474  2024-04-04           Taxes - Non Withheld Ind/SECA Electronic                   1905
371475  2024-04-04                Taxes - Non Withheld Ind/SECA Other                   1092
371476  2024-04-04                        Taxes - Railroad Retirement                      1
371477  2024-04-04                   Taxes - Withheld Individual/FICA                   2871

Question

I'd like to plot this data as a stacked bar chart.

x-axis should be 'record_date'.
y-axis should be the 'transaction_today_amt'.
The 'transaction_catg' values should be used for the stacked items.

I'm open to any plotting library. I.e. matplotlib, bokeh, plotly, etc.

What's a good way to implement this?

How to use pandas rolling_sum with sliding windows

I would like to calculate the sum or other calculation with sliding windows. For example I would like to calculate the sum on the last 10 data point from current position where A is True. Is there a way to do this ? With this it didn't return the value that I expect.

I put the expected value and the calculation on the side.

Thank you

In [63]: dt['As'] = pd.rolling_sum( dt.Val[ dt.A == True ], window=10, min_periods=1)

In [64]: dt
Out[64]:
    Val     A     B  As
0     1   NaN   NaN NaN
1     1   NaN   NaN NaN
2     1   NaN   NaN NaN
3     1   NaN   NaN NaN
4     6   NaN  True NaN
5     1   NaN   NaN NaN
6     2  True   NaN   1  pos 6 = 2
7     1   NaN   NaN NaN
8     3   NaN   NaN NaN
9     9  True   NaN   2  pos 9 + pos 6 = 11
10    1   NaN   NaN NaN
11    9   NaN   NaN NaN
12    1   NaN   NaN NaN
13    1   NaN  True NaN
14    1   NaN   NaN NaN
15    2  True   NaN   3  pos 15 + pos 9 + pos 6 = 13
16    1   NaN   NaN NaN
17    8   NaN   NaN NaN
18    1   NaN   NaN NaN
19    5  True   NaN   4  pos 19 + pos 15 = 7
20    1   NaN   NaN NaN
21    1   NaN   NaN NaN
22    2   NaN   NaN NaN
23    1   NaN   NaN NaN
24    7   NaN  True NaN
25    1   NaN   NaN NaN
26    1   NaN   NaN NaN
27    1   NaN   NaN NaN
28    3  True   NaN   5 pos 28 + pos 19 = 8

This almost do it

import numpy as np
import pandas as pd
dt = pd.read_csv('test2.csv')

dt['AVal'] = dt.Val[dt.A == True]
dt['ASum'] = pd.rolling_sum( dt.AVal, window=10, min_periods=1)
dt['ACnt'] = pd.rolling_count( dt.AVal, window=10)

In [4]: dt
Out[4]:
    Val     A     B  AVal  ASum  ACnt
0     1   NaN   NaN   NaN   NaN     0
1     1   NaN   NaN   NaN   NaN     0
2     1   NaN   NaN   NaN   NaN     0
3     1   NaN   NaN   NaN   NaN     0
4     6   NaN  True   NaN   NaN     0
5     1   NaN   NaN   NaN   NaN     0
6     2  True   NaN     2     2     1
7     1   NaN   NaN   NaN     2     1
8     3   NaN   NaN   NaN     2     1
9     9  True   NaN     9    11     2
10    1   NaN   NaN   NaN    11     2
11    9   NaN   NaN   NaN    11     2
12    1   NaN   NaN   NaN    11     2
13    1   NaN  True   NaN    11     2
14    1   NaN   NaN   NaN    11     2
15    2  True   NaN     2    13     3
16    1   NaN   NaN   NaN    11     2
17    8   NaN   NaN   NaN    11     2
18    1   NaN   NaN   NaN    11     2
19    5  True   NaN     5     7     2
20    1   NaN   NaN   NaN     7     2
21    1   NaN   NaN   NaN     7     2
22    2   NaN   NaN   NaN     7     2
23    1   NaN   NaN   NaN     7     2
24    7   NaN  True   NaN     7     2
25    1   NaN   NaN   NaN     5     1
26    1   NaN   NaN   NaN     5     1
27    1   NaN   NaN   NaN     5     1
28    3  True   NaN     3     8     2

but need to NaN for all the value in ASum and ACount where A is NaN Is this the way to do it ?

pandas scatter plot of multiple data sets with different x axis coordinates

I have data organized like this:

import pandas as pd
import matplotlib.pyplot as plt
import matplotlib as mpl
df = pd.DataFrame({(1,'x'):[0,1,2],(1,'y'):[10,11,12],(2,'x'):[1,2,3],(2,'y'):[11.5,11.8,13.2]})
df

In words: I have data sets for several items, and each data set consists of a number of x/y pairs, but each data set has its own, different set of x values.

Now I want to plot all of these data on the same plot, like this: (sorry, uploading pictures does not work tonight, but all the data sets should plot on the same x axis).

I can do it easily with a loop, like this:

fig1,ax = plt.subplots()
for item in range(1,3):
    df.xs(item,axis=1,level=0).plot(ax=ax,kind='line',x='x',y='y',style='o-',label=str(item))

But I wonder if there's a way to get the same plot without using a loop.

Normal view

Program

Dataframe

Question