From vnuorval@tcs.hut.fi Fri Aug 1 04:15:27 2003 Received: with ECARTIS (v1.0.0; list netdev); Fri, 01 Aug 2003 04:15:42 -0700 (PDT) Received: from mail.tcs.hut.fi (mail.tcs.hut.fi [130.233.215.20]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h71BFPFl004394 for ; Fri, 1 Aug 2003 04:15:26 -0700 Received: from rhea.tcs.hut.fi (rhea.tcs.hut.fi [130.233.215.147]) by mail.tcs.hut.fi (Postfix) with ESMTP id 61FB38001D1; Fri, 1 Aug 2003 14:15:23 +0300 (EEST) Received: from rhea.tcs.hut.fi (localhost [127.0.0.1]) by rhea.tcs.hut.fi (8.12.3/8.12.3/Debian-5) with ESMTP id h71BFN5L016559; Fri, 1 Aug 2003 14:15:23 +0300 Received: from localhost (vnuorval@localhost) by rhea.tcs.hut.fi (8.12.3/8.12.3/Debian-5) with ESMTP id h71BFLaU016555; Fri, 1 Aug 2003 14:15:22 +0300 Date: Fri, 1 Aug 2003 14:15:21 +0300 (EEST) From: Ville Nuorvala To: yoshfuji@linux-ipv6.org, Cc: netdev@oss.sgi.com Subject: [PATCH] IPV6: Incorrect hoplimit in ip6_push_pending_frames() In-Reply-To: Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII X-archive-position: 4419 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: vnuorval@tcs.hut.fi Precedence: bulk X-list: netdev Hi, I noticed the hop limit passed to ip6_append_data() isn't used by ip6_push_pending_frames(), which might lead to unexpected behavior with multicast and (ipv6-in-ipv6) tunneled packets. This patch (against Linux 2.6.0-test2 and cset 1.1595) fixes the problem. Thanks, Ville diff -Nur linux-2.5.OLD/include/linux/ipv6.h linux-2.5/include/linux/ipv6.h --- linux-2.5.OLD/include/linux/ipv6.h Thu Jul 31 18:07:13 2003 +++ linux-2.5/include/linux/ipv6.h Wed Jul 30 15:53:12 2003 @@ -189,6 +189,7 @@ struct ipv6_txoptions *opt; struct rt6_info *rt; struct flowi *fl; + int hop_limit; } cork; }; diff -Nur linux-2.5.OLD/net/ipv6/ip6_output.c linux-2.5/net/ipv6/ip6_output.c --- linux-2.5.OLD/net/ipv6/ip6_output.c Thu Jul 31 18:07:30 2003 +++ linux-2.5/net/ipv6/ip6_output.c Wed Jul 30 22:11:51 2003 @@ -1243,6 +1243,7 @@ dst_hold(&rt->u.dst); np->cork.rt = rt; np->cork.fl = fl; + np->cork.hop_limit = hlimit; inet->cork.fragsize = mtu = dst_pmtu(&rt->u.dst); inet->cork.length = 0; inet->sndmsg_page = NULL; @@ -1465,7 +1466,7 @@ hdr->payload_len = htons(skb->len - sizeof(struct ipv6hdr)); else hdr->payload_len = 0; - hdr->hop_limit = np->hop_limit; + hdr->hop_limit = np->cork.hop_limit; hdr->nexthdr = proto; ipv6_addr_copy(&hdr->saddr, &fl->fl6_src); ipv6_addr_copy(&hdr->daddr, final_dst); -- Ville Nuorvala Research Assistant, Institute of Digital Communications, Helsinki University of Technology email: vnuorval@tcs.hut.fi, phone: +358 (0)9 451 5257 From chas@locutus.cmf.nrl.navy.mil Fri Aug 1 07:02:14 2003 Received: with ECARTIS (v1.0.0; list netdev); Fri, 01 Aug 2003 07:02:26 -0700 (PDT) Received: from ginger.cmf.nrl.navy.mil (ginger.cmf.nrl.navy.mil [134.207.10.161]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h71E2CFn021118 for ; Fri, 1 Aug 2003 07:02:13 -0700 Received: from locutus.cmf.nrl.navy.mil (locutus.cmf.nrl.navy.mil [134.207.10.66]) by ginger.cmf.nrl.navy.mil (8.12.7/8.12.7) with ESMTP id h6THHosG027846; Tue, 29 Jul 2003 13:17:51 -0400 (EDT) Message-Id: <200307291717.h6THHosG027846@ginger.cmf.nrl.navy.mil> To: Mitchell Blank Jr cc: davem@redhat.com, netdev@oss.sgi.com Reply-To: chas3@users.sourceforge.net Subject: Re: [atmdrvr zatm] Remove obsolete EXACT_TS support In-reply-to: Your message of "Mon, 28 Jul 2003 00:13:23 PDT." <20030728071323.GT32831@gaz.sfgoth.com> Date: Tue, 29 Jul 2003 13:15:09 -0400 From: chas williams X-Spam-Score: () hits=-2.9 X-Virus-Scanned: NAI Completed X-Scanned-By: MIMEDefang 2.30 (www . roaringpenguin . com / mimedefang) X-archive-position: 4420 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: chas@cmf.nrl.navy.mil Precedence: bulk X-list: netdev dave, please apply the following patch (hopefully one will arrive shortly that removes cli() et al from zatm as well): In message <20030728071323.GT32831@gaz.sfgoth.com>,Mitchell Blank Jr writes: >Chas - here's another 2.6 atm driver patch. Please push it upstream. > >This removes the obsolete "exact timestamp" support from the zatm driver. >My understanding is that it was part of a research thing Werner did 8 or >so years ago. It has no purpose for any production use. I think 2.6 is >its time to die. > >Besides, these days we use do_gettimeofday() instead of xtime so we should >have a reasonably accurate timestamp anyways. > >The only program that uses the ZATM_GETTHIST ioctl is the src/debug/znth.c >from the userland distribution. This isn't even compiled as part of the >make process so I don't feel any guilt about breaking it. It should >probably also just go away. > >I don't have the hardware (and really doubt anyone else does either, but >that's another matter entirely) but it still compiles and insmod's. > >Patch is versus 2.6.0-test2. # This is a BitKeeper generated patch for the following project: # Project Name: Linux kernel tree # This patch format is intended for GNU patch command version 2.5 or higher. # This patch includes the following deltas: # ChangeSet 1.1596 -> 1.1597 # drivers/atm/zatm.c 1.12 -> 1.13 # drivers/atm/Kconfig 1.5 -> 1.6 # drivers/atm/zatm.h 1.1 -> 1.2 # include/linux/atm_zatm.h 1.1 -> 1.2 # # The following is the BitKeeper ChangeSet Log # -------------------------------------------- # 03/07/28 chas@relax.cmf.nrl.navy.mil 1.1597 # remove EXACT_TS remove from zatm (untested) # -------------------------------------------- # diff -Nru a/drivers/atm/Kconfig b/drivers/atm/Kconfig --- a/drivers/atm/Kconfig Tue Jul 29 13:15:41 2003 +++ b/drivers/atm/Kconfig Tue Jul 29 13:15:41 2003 @@ -164,18 +164,6 @@ Note that extended debugging may create certain race conditions itself. Enable this ONLY if you suspect problems with the driver. -config ATM_ZATM_EXACT_TS - bool "Enable usec resolution timestamps" - depends on ATM_ZATM && X86 - help - The uPD98401 SAR chip supports a high-resolution timer (approx. 30 - MHz) that is used for very accurate reception timestamps. Because - that timer overflows after 140 seconds, and also to avoid timer - drift, time measurements need to be periodically synchronized with - the normal system time. Enabling this feature will add some general - overhead for timer synchronization and also per-packet overhead for - time conversion. - # bool 'Rolfs TI TNETA1570' CONFIG_ATM_TNETA1570 y # if [ "$CONFIG_ATM_TNETA1570" = "y" ]; then # bool ' Enable extended debugging' CONFIG_ATM_TNETA1570_DEBUG n diff -Nru a/drivers/atm/zatm.c b/drivers/atm/zatm.c --- a/drivers/atm/zatm.c Tue Jul 29 13:15:41 2003 +++ b/drivers/atm/zatm.c Tue Jul 29 13:15:41 2003 @@ -52,13 +52,6 @@ #define DPRINTK(format,args...) #endif -#ifndef __i386__ -#ifdef CONFIG_ATM_ZATM_EXACT_TS -#warning Precise timestamping only available on i386 platform -#undef CONFIG_ATM_ZATM_EXACT_TS -#endif -#endif - #ifndef CONFIG_ATM_ZATM_DEBUG @@ -347,150 +340,6 @@ restore_flags(flags); } - -/*----------------------- high-precision timestamps -------------------------*/ - - -#ifdef CONFIG_ATM_ZATM_EXACT_TS - -static struct timer_list sync_timer; - - -/* - * Note: the exact time is not normalized, i.e. tv_usec can be > 1000000. - * This must be handled by higher layers. - */ - -static inline struct timeval exact_time(struct zatm_dev *zatm_dev,u32 ticks) -{ - struct timeval tmp; - - tmp = zatm_dev->last_time; - tmp.tv_usec += ((s64) (ticks-zatm_dev->last_clk)* - (s64) zatm_dev->factor) >> TIMER_SHIFT; - return tmp; -} - - -static void zatm_clock_sync(unsigned long dummy) -{ - struct atm_dev *atm_dev; - struct zatm_dev *zatm_dev; - - for (atm_dev = zatm_boards; atm_dev; atm_dev = zatm_dev->more) { - unsigned long flags,interval; - int diff; - struct timeval now,expected; - u32 ticks; - - zatm_dev = ZATM_DEV(atm_dev); - save_flags(flags); - cli(); - ticks = zpeekl(zatm_dev,uPD98401_TSR); - do_gettimeofday(&now); - restore_flags(flags); - expected = exact_time(zatm_dev,ticks); - diff = 1000000*(expected.tv_sec-now.tv_sec)+ - (expected.tv_usec-now.tv_usec); - zatm_dev->timer_history[zatm_dev->th_curr].real = now; - zatm_dev->timer_history[zatm_dev->th_curr].expected = expected; - zatm_dev->th_curr = (zatm_dev->th_curr+1) & - (ZATM_TIMER_HISTORY_SIZE-1); - interval = 1000000*(now.tv_sec-zatm_dev->last_real_time.tv_sec) - +(now.tv_usec-zatm_dev->last_real_time.tv_usec); - if (diff >= -ADJ_REP_THRES && diff <= ADJ_REP_THRES) - zatm_dev->timer_diffs = 0; - else -#ifndef AGGRESSIVE_DEBUGGING - if (++zatm_dev->timer_diffs >= ADJ_MSG_THRES) -#endif - { - zatm_dev->timer_diffs = 0; - printk(KERN_INFO DEV_LABEL ": TSR update after %ld us:" - " calculation differed by %d us\n",interval,diff); -#ifdef AGGRESSIVE_DEBUGGING - printk(KERN_DEBUG " %d.%08d -> %d.%08d (%lu)\n", - zatm_dev->last_real_time.tv_sec, - zatm_dev->last_real_time.tv_usec, - now.tv_sec,now.tv_usec,interval); - printk(KERN_DEBUG " %u -> %u (%d)\n", - zatm_dev->last_clk,ticks,ticks-zatm_dev->last_clk); - printk(KERN_DEBUG " factor %u\n",zatm_dev->factor); -#endif - } - if (diff < -ADJ_IGN_THRES || diff > ADJ_IGN_THRES) { - /* filter out any major changes (e.g. time zone setup and - such) */ - zatm_dev->last_time = now; - zatm_dev->factor = - (1000 << TIMER_SHIFT)/(zatm_dev->khz+1); - } - else { - zatm_dev->last_time = expected; - /* - * Is the accuracy of udelay really only about 1:300 on - * a 90 MHz Pentium ? Well, the following line avoids - * the problem, but ... - * - * What it does is simply: - * - * zatm_dev->factor = (interval << TIMER_SHIFT)/ - * (ticks-zatm_dev->last_clk); - */ -#define S(x) #x /* "stringification" ... */ -#define SX(x) S(x) - asm("movl %2,%%ebx\n\t" - "subl %3,%%ebx\n\t" - "xorl %%edx,%%edx\n\t" - "shldl $" SX(TIMER_SHIFT) ",%1,%%edx\n\t" - "shl $" SX(TIMER_SHIFT) ",%1\n\t" - "divl %%ebx\n\t" - : "=a" (zatm_dev->factor) - : "0" (interval-diff),"g" (ticks), - "g" (zatm_dev->last_clk) - : "ebx","edx","cc"); -#undef S -#undef SX -#ifdef AGGRESSIVE_DEBUGGING - printk(KERN_DEBUG " (%ld << %d)/(%u-%u) = %u\n", - interval,TIMER_SHIFT,ticks,zatm_dev->last_clk, - zatm_dev->factor); -#endif - } - zatm_dev->last_real_time = now; - zatm_dev->last_clk = ticks; - } - mod_timer(&sync_timer,sync_timer.expires+POLL_INTERVAL*HZ); -} - - -static void __init zatm_clock_init(struct zatm_dev *zatm_dev) -{ - static int start_timer = 1; - unsigned long flags; - - zatm_dev->factor = (1000 << TIMER_SHIFT)/(zatm_dev->khz+1); - zatm_dev->timer_diffs = 0; - memset(zatm_dev->timer_history,0,sizeof(zatm_dev->timer_history)); - zatm_dev->th_curr = 0; - save_flags(flags); - cli(); - do_gettimeofday(&zatm_dev->last_time); - zatm_dev->last_clk = zpeekl(zatm_dev,uPD98401_TSR); - if (start_timer) { - start_timer = 0; - init_timer(&sync_timer); - sync_timer.expires = jiffies+POLL_INTERVAL*HZ; - sync_timer.function = zatm_clock_sync; - add_timer(&sync_timer); - } - restore_flags(flags); -} - - -#endif - - /*----------------------------------- RX ------------------------------------*/ @@ -581,11 +430,7 @@ EVENT("error code 0x%x/0x%x\n",(here[3] & uPD98401_AAL5_ES) >> uPD98401_AAL5_ES_SHIFT,error); skb = ((struct rx_buffer_head *) bus_to_virt(here[2]))->skb; -#ifdef CONFIG_ATM_ZATM_EXACT_TS - skb->stamp = exact_time(zatm_dev,here[1]); -#else do_gettimeofday(&skb->stamp); -#endif #if 0 printk("[-3..0] 0x%08lx 0x%08lx 0x%08lx 0x%08lx\n",((unsigned *) skb->data)[-3], ((unsigned *) skb->data)[-2],((unsigned *) skb->data)[-1], @@ -1455,9 +1300,6 @@ "MHz\n",dev->number, (zin(VER) & uPD98401_MAJOR) >> uPD98401_MAJOR_SHIFT, zin(VER) & uPD98401_MINOR,zatm_dev->khz/1000,zatm_dev->khz % 1000); -#ifdef CONFIG_ATM_ZATM_EXACT_TS - zatm_clock_init(zatm_dev); -#endif return uPD98402_init(dev); } @@ -1699,22 +1541,6 @@ restore_flags(flags); return 0; } -#ifdef CONFIG_ATM_ZATM_EXACT_TS - case ZATM_GETTHIST: - { - int i; - struct zatm_t_hist hs[ZATM_TIMER_HISTORY_SIZE]; - save_flags(flags); - cli(); - for (i = 0; i < ZATM_TIMER_HISTORY_SIZE; i++) - hs[i] = zatm_dev->timer_history[ - (zatm_dev->th_curr+i) & - (ZATM_TIMER_HISTORY_SIZE-1)]; - restore_flags(flags); - return copy_to_user((struct zatm_t_hist *) arg, - hs, sizeof(hs)) ? -EFAULT : 0; - } -#endif default: if (!dev->phy->ioctl) return -ENOIOCTLCMD; return dev->phy->ioctl(dev,cmd,arg); diff -Nru a/drivers/atm/zatm.h b/drivers/atm/zatm.h --- a/drivers/atm/zatm.h Tue Jul 29 13:15:41 2003 +++ b/drivers/atm/zatm.h Tue Jul 29 13:15:41 2003 @@ -40,31 +40,6 @@ #define MBX_TX_0 2 #define MBX_TX_1 3 - -/* - * mkdep doesn't spot this dependency, but that's okay, because zatm.c uses - * CONFIG_ATM_ZATM_EXACT_TS too. - */ - -#ifdef CONFIG_ATM_ZATM_EXACT_TS -#define POLL_INTERVAL 60 /* TSR poll interval in seconds; must be <= - (2^31-1)/clock */ -#define TIMER_SHIFT 20 /* scale factor for fixed-point arithmetic; - 1 << TIMER_SHIFT must be - (1) <= (2^64-1)/(POLL_INTERVAL*clock), - (2) >> clock/10^6, and - (3) <= (2^32-1)/1000 */ -#define ADJ_IGN_THRES 1000000 /* don't adjust if we're off by more than that - many usecs - this filters clock corrections, - time zone changes, etc. */ -#define ADJ_REP_THRES 20000 /* report only differences of more than that - many usecs (don't mention single lost timer - ticks; 10 msec is only 0.03% anyway) */ -#define ADJ_MSG_THRES 5 /* issue complaints only if getting that many - significant timer differences in a row */ -#endif - - struct zatm_vcc { /*-------------------------------- RX part */ int rx_chan; /* RX channel, 0 if none */ @@ -103,17 +78,6 @@ u32 pool_base; /* Free buffer pool dsc (word addr) */ /*-------------------------------- ZATM links */ struct atm_dev *more; /* other ZATM devices */ -#ifdef CONFIG_ATM_ZATM_EXACT_TS - /*-------------------------------- timestamp calculation */ - u32 last_clk; /* results of last poll: clock, */ - struct timeval last_time; /* virtual time and */ - struct timeval last_real_time; /* real time */ - u32 factor; /* multiplication factor */ - int timer_diffs; /* number of significant deviations */ - struct zatm_t_hist timer_history[ZATM_TIMER_HISTORY_SIZE]; - /* record of timer synchronizations */ - int th_curr; /* current position */ -#endif /*-------------------------------- general information */ int mem; /* RAM on board (in bytes) */ int khz; /* timer clock */ diff -Nru a/include/linux/atm_zatm.h b/include/linux/atm_zatm.h --- a/include/linux/atm_zatm.h Tue Jul 29 13:15:41 2003 +++ b/include/linux/atm_zatm.h Tue Jul 29 13:15:41 2003 @@ -21,9 +21,6 @@ /* get statistics and zero */ #define ZATM_SETPOOL _IOW('a',ATMIOC_SARPRV+3,struct atmif_sioc) /* set pool parameters */ -#define ZATM_GETTHIST _IOW('a',ATMIOC_SARPRV+4,struct atmif_sioc) - /* get a history of timer - differences */ struct zatm_pool_info { int ref_count; /* free buffer pool usage counters */ From chas@locutus.cmf.nrl.navy.mil Fri Aug 1 07:02:13 2003 Received: with ECARTIS (v1.0.0; list netdev); Fri, 01 Aug 2003 07:02:26 -0700 (PDT) Received: from ginger.cmf.nrl.navy.mil (ginger.cmf.nrl.navy.mil [134.207.10.161]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h71E2CFl021118 for ; Fri, 1 Aug 2003 07:02:12 -0700 Received: from locutus.cmf.nrl.navy.mil (locutus.cmf.nrl.navy.mil [134.207.10.66]) by ginger.cmf.nrl.navy.mil (8.12.7/8.12.7) with ESMTP id h6VEQgsG023826; Thu, 31 Jul 2003 10:26:42 -0400 (EDT) Message-Id: <200307311426.h6VEQgsG023826@ginger.cmf.nrl.navy.mil> To: Mitchell Blank Jr cc: davem@redhat.com, netdev@oss.sgi.com Reply-To: chas3@users.sourceforge.net Subject: Re: [Linux-ATM-General] Re: [atmdrvr zatm] Remove obsolete EXACT_TS support In-reply-to: Your message of "Wed, 30 Jul 2003 15:57:42 PDT." <20030730225741.GA57991@gaz.sfgoth.com> Date: Thu, 31 Jul 2003 10:23:58 -0400 From: chas williams X-Spam-Score: () hits=-0.3 X-Virus-Scanned: NAI Completed X-Scanned-By: MIMEDefang 2.30 (www . roaringpenguin . com / mimedefang) X-archive-position: 4420 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: chas@cmf.nrl.navy.mil Precedence: bulk X-list: netdev please apply to 2.6. zatm will now compile on smp. it might actually work if someone had some hardware to test it. [atm]: [zatm] convert cli() to spinlock # This is a BitKeeper generated patch for the following project: # Project Name: Linux kernel tree # This patch format is intended for GNU patch command version 2.5 or higher. # This patch includes the following deltas: # ChangeSet 1.1597 -> 1.1598 # drivers/atm/zatm.c 1.13 -> 1.14 # drivers/atm/uPD98402.c 1.4 -> 1.5 # drivers/atm/zatm.h 1.2 -> 1.3 # # The following is the BitKeeper ChangeSet Log # -------------------------------------------- # 03/07/31 chas@relax.cmf.nrl.navy.mil 1.1598 # [zatm] convert cli() to spinlock # -------------------------------------------- # diff -Nru a/drivers/atm/uPD98402.c b/drivers/atm/uPD98402.c --- a/drivers/atm/uPD98402.c Thu Jul 31 10:25:25 2003 +++ b/drivers/atm/uPD98402.c Thu Jul 31 10:25:25 2003 @@ -27,6 +27,7 @@ struct k_sonet_stats sonet_stats;/* link diagnostics */ unsigned char framing; /* SONET/SDH framing */ int loop_mode; /* loopback mode */ + spinlock_t lock; }; @@ -71,14 +72,13 @@ default: return -EINVAL; } - save_flags(flags); - cli(); + spin_lock_irqsave(&PRIV(dev)->lock, flags); PUT(set[0],C11T); PUT(set[1],C12T); PUT(set[2],C13T); PUT((GET(MDR) & ~uPD98402_MDR_SS_MASK) | (set[3] << uPD98402_MDR_SS_SHIFT),MDR); - restore_flags(flags); + spin_unlock_irqrestore(&PRIV(dev)->lock, flags); return 0; } @@ -88,12 +88,11 @@ unsigned long flags; unsigned char s[3]; - save_flags(flags); - cli(); + spin_lock_irqsave(&PRIV(dev)->lock, flags); s[0] = GET(C11R); s[1] = GET(C12R); s[2] = GET(C13R); - restore_flags(flags); + spin_unlock_irqrestore(&PRIV(dev)->lock, flags); return (put_user(s[0], arg) || put_user(s[1], arg+1) || put_user(s[2], arg+2) || put_user(0xff, arg+3) || put_user(0xff, arg+4) || put_user(0xff, arg+5)) ? -EFAULT : 0; @@ -214,6 +213,7 @@ DPRINTK("phy_start\n"); if (!(PRIV(dev) = kmalloc(sizeof(struct uPD98402_priv),GFP_KERNEL))) return -ENOMEM; + spin_lock_init(&PRIV(dev)->lock); memset(&PRIV(dev)->sonet_stats,0,sizeof(struct k_sonet_stats)); (void) GET(PCR); /* clear performance events */ PUT(uPD98402_PFM_FJ,PCMR); /* ignore frequency adj */ diff -Nru a/drivers/atm/zatm.c b/drivers/atm/zatm.c --- a/drivers/atm/zatm.c Thu Jul 31 10:25:25 2003 +++ b/drivers/atm/zatm.c Thu Jul 31 10:25:25 2003 @@ -195,11 +195,10 @@ sizeof(struct rx_buffer_head); } size += align; - save_flags(flags); - cli(); + spin_lock_irqsave(&zatm_dev->lock, flags); free = zpeekl(zatm_dev,zatm_dev->pool_base+2*pool) & uPD98401_RXFP_REMAIN; - restore_flags(flags); + spin_unlock_irqrestore(&zatm_dev->lock, flags); if (free >= zatm_dev->pool_info[pool].low_water) return; EVENT("starting ... POOL: 0x%x, 0x%x\n", zpeekl(zatm_dev,zatm_dev->pool_base+2*pool), @@ -228,22 +227,22 @@ head->skb = skb; EVENT("enq skb 0x%08lx/0x%08lx\n",(unsigned long) skb, (unsigned long) head); - cli(); + spin_lock_irqsave(&zatm_dev->lock, flags); if (zatm_dev->last_free[pool]) ((struct rx_buffer_head *) (zatm_dev->last_free[pool]-> data))[-1].link = virt_to_bus(head); zatm_dev->last_free[pool] = skb; skb_queue_tail(&zatm_dev->pool[pool],skb); - restore_flags(flags); + spin_unlock_irqrestore(&zatm_dev->lock, flags); free++; } if (first) { - cli(); + spin_lock_irqsave(&zatm_dev->lock, flags); zwait; zout(virt_to_bus(first),CER); zout(uPD98401_ADD_BAT | (pool << uPD98401_POOL_SHIFT) | count, CMR); - restore_flags(flags); + spin_unlock_irqrestore(&zatm_dev->lock, flags); EVENT ("POOL: 0x%x, 0x%x\n", zpeekl(zatm_dev,zatm_dev->pool_base+2*pool), zpeekl(zatm_dev,zatm_dev->pool_base+2*pool+1)); @@ -286,8 +285,7 @@ size = pool-ZATM_AAL5_POOL_BASE; if (size < 0) size = 0; /* 64B... */ else if (size > 10) size = 10; /* ... 64kB */ - save_flags(flags); - cli(); + spin_lock_irqsave(&zatm_dev->lock, flags); zpokel(zatm_dev,((zatm_dev->pool_info[pool].low_water/4) << uPD98401_RXFP_ALERT_SHIFT) | (1 << uPD98401_RXFP_BTSZ_SHIFT) | @@ -295,7 +293,7 @@ zatm_dev->pool_base+pool*2); zpokel(zatm_dev,(unsigned long) dummy,zatm_dev->pool_base+ pool*2+1); - restore_flags(flags); + spin_unlock_irqrestore(&zatm_dev->lock, flags); zatm_dev->last_free[pool] = NULL; refill_pool(dev,pool); } @@ -315,29 +313,29 @@ { struct zatm_pool_info *pool; unsigned long offset,flags; + struct zatm_dev *zatm_dev = ZATM_DEV(vcc->dev); DPRINTK("start 0x%08lx dest 0x%08lx len %d\n",start,dest,len); if (len < PAGE_SIZE) return; - pool = &ZATM_DEV(vcc->dev)->pool_info[ZATM_VCC(vcc)->pool]; + pool = &zatm_dev->pool_info[ZATM_VCC(vcc)->pool]; offset = (dest-start) & (PAGE_SIZE-1); - save_flags(flags); - cli(); + spin_lock_irqsave(&zatm_dev->lock, flags); if (!offset || pool->offset == offset) { pool->next_cnt = 0; - restore_flags(flags); + spin_unlock_irqrestore(&zatm_dev->lock, flags); return; } if (offset != pool->next_off) { pool->next_off = offset; pool->next_cnt = 0; - restore_flags(flags); + spin_unlock_irqrestore(&zatm_dev->lock, flags); return; } if (++pool->next_cnt >= pool->next_thres) { pool->offset = pool->next_off; pool->next_cnt = 0; } - restore_flags(flags); + spin_unlock_irqrestore(&zatm_dev->lock, flags); } /*----------------------------------- RX ------------------------------------*/ @@ -535,20 +533,19 @@ zatm_vcc->pool = ZATM_AAL0_POOL; } if (zatm_vcc->pool < 0) return -EMSGSIZE; - save_flags(flags); - cli(); + spin_lock_irqsave(&zatm_dev->lock, flags); zwait; zout(uPD98401_OPEN_CHAN,CMR); zwait; DPRINTK("0x%x 0x%x\n",zin(CMR),zin(CER)); chan = (zin(CMR) & uPD98401_CHAN_ADDR) >> uPD98401_CHAN_ADDR_SHIFT; - restore_flags(flags); + spin_unlock_irqrestore(&zatm_dev->lock, flags); DPRINTK("chan is %d\n",chan); if (!chan) return -EAGAIN; use_pool(vcc->dev,zatm_vcc->pool); DPRINTK("pool %d\n",zatm_vcc->pool); /* set up VC descriptor */ - cli(); + spin_lock_irqsave(&zatm_dev->lock, flags); zpokel(zatm_dev,zatm_vcc->pool << uPD98401_RXVC_POOL_SHIFT, chan*VC_SIZE/4); zpokel(zatm_dev,uPD98401_RXVC_OD | (vcc->qos.aal == ATM_AAL5 ? @@ -556,7 +553,7 @@ zpokel(zatm_dev,0,chan*VC_SIZE/4+2); zatm_vcc->rx_chan = chan; zatm_dev->rx_map[chan] = vcc; - restore_flags(flags); + spin_unlock_irqrestore(&zatm_dev->lock, flags); return 0; } @@ -572,14 +569,13 @@ zatm_dev = ZATM_DEV(vcc->dev); zatm_vcc = ZATM_VCC(vcc); if (!zatm_vcc->rx_chan) return 0; - save_flags(flags); - cli(); + spin_lock_irqsave(&zatm_dev->lock, flags); /* should also handle VPI @@@ */ pos = vcc->vci >> 1; shift = (1-(vcc->vci & 1)) << 4; zpokel(zatm_dev,(zpeekl(zatm_dev,pos) & ~(0xffff << shift)) | ((zatm_vcc->rx_chan | uPD98401_RXLT_ENBL) << shift),pos); - restore_flags(flags); + spin_unlock_irqrestore(&zatm_dev->lock, flags); return 0; } @@ -596,9 +592,8 @@ if (!zatm_vcc->rx_chan) return; DPRINTK("close_rx\n"); /* disable receiver */ - save_flags(flags); if (vcc->vpi != ATM_VPI_UNSPEC && vcc->vci != ATM_VCI_UNSPEC) { - cli(); + spin_lock_irqsave(&zatm_dev->lock, flags); pos = vcc->vci >> 1; shift = (1-(vcc->vci & 1)) << 4; zpokel(zatm_dev,zpeekl(zatm_dev,pos) & ~(0xffff << shift),pos); @@ -606,9 +601,9 @@ zout(uPD98401_NOP,CMR); zwait; zout(uPD98401_NOP,CMR); - restore_flags(flags); + spin_unlock_irqrestore(&zatm_dev->lock, flags); } - cli(); + spin_lock_irqsave(&zatm_dev->lock, flags); zwait; zout(uPD98401_DEACT_CHAN | uPD98401_CHAN_RT | (zatm_vcc->rx_chan << uPD98401_CHAN_ADDR_SHIFT),CMR); @@ -620,7 +615,7 @@ if (!(zin(CMR) & uPD98401_CHAN_ADDR)) printk(KERN_CRIT DEV_LABEL "(itf %d): can't close RX channel " "%d\n",vcc->dev->number,zatm_vcc->rx_chan); - restore_flags(flags); + spin_unlock_irqrestore(&zatm_dev->lock, flags); zatm_dev->rx_map[zatm_vcc->rx_chan] = NULL; zatm_vcc->rx_chan = 0; unuse_pool(vcc->dev,zatm_vcc->pool); @@ -673,11 +668,10 @@ zatm_dev = ZATM_DEV(vcc->dev); zatm_vcc = ZATM_VCC(vcc); EVENT("iovcnt=%d\n",skb_shinfo(skb)->nr_frags,0); - save_flags(flags); - cli(); + spin_lock_irqsave(&zatm_dev->lock, flags); if (!skb_shinfo(skb)->nr_frags) { if (zatm_vcc->txing == RING_ENTRIES-1) { - restore_flags(flags); + spin_unlock_irqrestore(&zatm_dev->lock, flags); return RING_BUSY; } zatm_vcc->txing++; @@ -732,7 +726,7 @@ zwait; zout(uPD98401_TX_READY | (zatm_vcc->tx_chan << uPD98401_CHAN_ADDR_SHIFT),CMR); - restore_flags(flags); + spin_unlock_irqrestore(&zatm_dev->lock, flags); EVENT("done\n",0,0); return 0; } @@ -866,15 +860,14 @@ if (zatm_dev->tx_bw < *pcr) return -EAGAIN; zatm_dev->tx_bw -= *pcr; } - save_flags(flags); - cli(); + spin_lock_irqsave(&zatm_dev->lock, flags); DPRINTK("i = %d, m = %d, PCR = %d\n",i,m,*pcr); zpokel(zatm_dev,(i << uPD98401_IM_I_SHIFT) | m,uPD98401_IM(shaper)); zpokel(zatm_dev,c << uPD98401_PC_C_SHIFT,uPD98401_PC(shaper)); zpokel(zatm_dev,0,uPD98401_X(shaper)); zpokel(zatm_dev,0,uPD98401_Y(shaper)); zpokel(zatm_dev,uPD98401_PS_E,uPD98401_PS(shaper)); - restore_flags(flags); + spin_unlock_irqrestore(&zatm_dev->lock, flags); return shaper; } @@ -889,11 +882,10 @@ if (--zatm_dev->ubr_ref_cnt) return; zatm_dev->ubr = -1; } - save_flags(flags); - cli(); + spin_lock_irqsave(&zatm_dev->lock, flags); zpokel(zatm_dev,zpeekl(zatm_dev,uPD98401_PS(shaper)) & ~uPD98401_PS_E, uPD98401_PS(shaper)); - restore_flags(flags); + spin_unlock_irqrestore(&zatm_dev->lock, flags); zatm_dev->free_shapers |= 1 << shaper; } @@ -912,8 +904,6 @@ chan = zatm_vcc->tx_chan; if (!chan) return; DPRINTK("close_tx\n"); - save_flags(flags); - cli(); while (skb_peek(&zatm_vcc->backlog)) { if (once) { printk("waiting for backlog to drain ...\n"); @@ -932,6 +922,7 @@ DPRINTK("waiting for TX queue to drain ... %p\n",skb); sleep_on(&zatm_vcc->tx_wait); } + spin_lock_irqsave(&zatm_dev->lock, flags); #if 0 zwait; zout(uPD98401_DEACT_CHAN | (chan << uPD98401_CHAN_ADDR_SHIFT),CMR); @@ -942,7 +933,7 @@ if (!(zin(CMR) & uPD98401_CHAN_ADDR)) printk(KERN_CRIT DEV_LABEL "(itf %d): can't close TX channel " "%d\n",vcc->dev->number,chan); - restore_flags(flags); + spin_unlock_irqrestore(&zatm_dev->lock, flags); zatm_vcc->tx_chan = 0; zatm_dev->tx_map[chan] = NULL; if (zatm_vcc->shaper != zatm_dev->ubr) { @@ -967,14 +958,13 @@ zatm_vcc = ZATM_VCC(vcc); zatm_vcc->tx_chan = 0; if (vcc->qos.txtp.traffic_class == ATM_NONE) return 0; - save_flags(flags); - cli(); + spin_lock_irqsave(&zatm_dev->lock, flags); zwait; zout(uPD98401_OPEN_CHAN,CMR); zwait; DPRINTK("0x%x 0x%x\n",zin(CMR),zin(CER)); chan = (zin(CMR) & uPD98401_CHAN_ADDR) >> uPD98401_CHAN_ADDR_SHIFT; - restore_flags(flags); + spin_unlock_irqrestore(&zatm_dev->lock, flags); DPRINTK("chan is %d\n",chan); if (!chan) return -EAGAIN; unlimited = vcc->qos.txtp.traffic_class == ATM_UBR && @@ -1022,15 +1012,14 @@ zatm_dev = ZATM_DEV(vcc->dev); zatm_vcc = ZATM_VCC(vcc); if (!zatm_vcc->tx_chan) return 0; - save_flags(flags); /* set up VC descriptor */ - cli(); + spin_lock_irqsave(&zatm_dev->lock, flags); zpokel(zatm_dev,0,zatm_vcc->tx_chan*VC_SIZE/4); zpokel(zatm_dev,uPD98401_TXVC_L | (zatm_vcc->shaper << uPD98401_TXVC_SHP_SHIFT) | (vcc->vpi << uPD98401_TXVC_VPI_SHIFT) | vcc->vci,zatm_vcc->tx_chan*VC_SIZE/4+1); zpokel(zatm_dev,0,zatm_vcc->tx_chan*VC_SIZE/4+2); - restore_flags(flags); + spin_unlock_irqrestore(&zatm_dev->lock, flags); zatm_dev->tx_map[zatm_vcc->tx_chan] = vcc; return 0; } @@ -1236,6 +1225,7 @@ DPRINTK(">zatm_init\n"); zatm_dev = ZATM_DEV(dev); + spin_lock_init(&zatm_dev->lock); pci_dev = zatm_dev->pci_dev; zatm_dev->base = pci_resource_start(pci_dev, 0); zatm_dev->irq = pci_dev->irq; @@ -1285,14 +1275,13 @@ do { unsigned long flags; - save_flags(flags); - cli(); + spin_lock_irqsave(&zatm_dev->lock, flags); t0 = zpeekl(zatm_dev,uPD98401_TSR); udelay(10); t1 = zpeekl(zatm_dev,uPD98401_TSR); udelay(1010); t2 = zpeekl(zatm_dev,uPD98401_TSR); - restore_flags(flags); + spin_unlock_irqrestore(&zatm_dev->lock, flags); } while (t0 > t1 || t1 > t2); /* loop if wrapping ... */ zatm_dev->khz = t2-2*t1+t0; @@ -1492,14 +1481,13 @@ return -EFAULT; if (pool < 0 || pool > ZATM_LAST_POOL) return -EINVAL; - save_flags(flags); - cli(); + spin_lock_irqsave(&zatm_dev->lock, flags); info = zatm_dev->pool_info[pool]; if (cmd == ZATM_GETPOOLZ) { zatm_dev->pool_info[pool].rqa_count = 0; zatm_dev->pool_info[pool].rqu_count = 0; } - restore_flags(flags); + spin_unlock_irqrestore(&zatm_dev->lock, flags); return copy_to_user( &((struct zatm_pool_req *) arg)->info, &info,sizeof(info)) ? -EFAULT : 0; @@ -1530,15 +1518,14 @@ if (info.low_water >= info.high_water || info.low_water < 0) return -EINVAL; - save_flags(flags); - cli(); + spin_lock_irqsave(&zatm_dev->lock, flags); zatm_dev->pool_info[pool].low_water = info.low_water; zatm_dev->pool_info[pool].high_water = info.high_water; zatm_dev->pool_info[pool].next_thres = info.next_thres; - restore_flags(flags); + spin_unlock_irqrestore(&zatm_dev->lock, flags); return 0; } default: diff -Nru a/drivers/atm/zatm.h b/drivers/atm/zatm.h --- a/drivers/atm/zatm.h Thu Jul 31 10:25:25 2003 +++ b/drivers/atm/zatm.h Thu Jul 31 10:25:25 2003 @@ -85,6 +85,7 @@ unsigned char irq; /* IRQ */ unsigned int base; /* IO base address */ struct pci_dev *pci_dev; /* PCI stuff */ + spinlock_t lock; }; From willy@www.linux.org.uk Fri Aug 1 08:02:35 2003 Received: with ECARTIS (v1.0.0; list netdev); Fri, 01 Aug 2003 08:02:45 -0700 (PDT) Received: from www.linux.org.uk (IDENT:zD29xXS/6K4bSxwPbxUDl3tjJbe2uGQJ@parcelfarce.linux.theplanet.co.uk [195.92.249.252]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h71F2XFl030278 for ; Fri, 1 Aug 2003 08:02:35 -0700 Received: from willy by www.linux.org.uk with local (Exim 4.14) id 19ibQO-0003NL-OB for netdev@oss.sgi.com; Fri, 01 Aug 2003 16:02:32 +0100 Date: Fri, 1 Aug 2003 16:02:32 +0100 From: Matthew Wilcox To: netdev@oss.sgi.com Subject: [PATCH] ethtool_ops rev 4 Message-ID: <20030801150232.GV22222@parcelfarce.linux.theplanet.co.uk> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.4.1i X-archive-position: 4421 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: willy@debian.org Precedence: bulk X-list: netdev At 55k, I doubt you want to see it posted to the list; patch is available from http://ftp.linux.org.uk/pub/linux/willy/patches/ethtool4.diff and here's the diffstat drivers/net/8139too.c | 330 ++++++++-------------- drivers/net/tg3.c | 584 ++++++++++++++++------------------------ include/linux/ethtool.h | 100 ++++++ include/linux/netdevice.h | 5 net/core/Makefile | 4 net/core/dev.c | 16 - net/core/ethtool.c | 671 ++++++++++++++++++++++++++++++++++++++++++++++ 7 files changed, 1154 insertions(+), 556 deletions(-) Patch has received light testing on an rtl8139c card: Settings for eth0: Supported ports: [ TP MII ] Supported link modes: 10baseT/Half 10baseT/Full 100baseT/Half 100baseT/Full Supports auto-negotiation: Yes Advertised link modes: 10baseT/Half 10baseT/Full 100baseT/Half 100baseT/Full Advertised auto-negotiation: Yes Speed: 100Mb/s Duplex: Full Port: MII PHYAD: 32 Transceiver: internal Auto-negotiation: on Supports Wake-on: pumbg Wake-on: d Current message level: 0xffffffff (-1) Link detected: yes but obviously it doesn't support all the ethtool options that some cards do. -- "It's not Hollywood. War is real, war is primarily not about defeat or victory, it is about death. I've seen thousands and thousands of dead bodies. Do you think I want to have an academic debate on this subject?" -- Robert Fisk From garzik@gtf.org Fri Aug 1 08:40:29 2003 Received: with ECARTIS (v1.0.0; list netdev); Fri, 01 Aug 2003 08:40:34 -0700 (PDT) Received: from havoc.gtf.org (host-64-213-145-173.atlantasolutions.com [64.213.145.173] (may be forged)) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h71FeSFl027853 for ; Fri, 1 Aug 2003 08:40:29 -0700 Received: by havoc.gtf.org (Postfix, from userid 500) id D82626698; Fri, 1 Aug 2003 11:40:21 -0400 (EDT) Date: Fri, 1 Aug 2003 11:40:21 -0400 From: Jeff Garzik To: Matthew Wilcox Cc: netdev@oss.sgi.com Subject: Re: [PATCH] ethtool_ops rev 4 Message-ID: <20030801154021.GA7696@gtf.org> References: <20030801150232.GV22222@parcelfarce.linux.theplanet.co.uk> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20030801150232.GV22222@parcelfarce.linux.theplanet.co.uk> User-Agent: Mutt/1.3.28i X-archive-position: 4422 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: jgarzik@pobox.com Precedence: bulk X-list: netdev On Fri, Aug 01, 2003 at 04:02:32PM +0100, Matthew Wilcox wrote: > and here's the diffstat > > drivers/net/8139too.c | 330 ++++++++-------------- > drivers/net/tg3.c | 584 ++++++++++++++++------------------------ > include/linux/ethtool.h | 100 ++++++ > include/linux/netdevice.h | 5 > net/core/Makefile | 4 > net/core/dev.c | 16 - > net/core/ethtool.c | 671 ++++++++++++++++++++++++++++++++++++++++++++++ > 7 files changed, 1154 insertions(+), 556 deletions(-) Comments: * need SET_ETHTOOL_OPS macro or HAVE_ETHTOOL_OPS test macro or similar * I still do not see the need to change a simple storage of a constant (into ethtool_gdrvinfo) into _four_ separate function call hooks (reg dump len, eeprom dump len, nic-specific stats len, self-test len). Internal kernel code that needs this information is always a slow path anyway, so just call the ->get_drvinfo hook internally. * I prefer not to add '#include ' to ethtool.h Other than those, looks real good. Jeff From jmorris@intercode.com.au Fri Aug 1 08:51:17 2003 Received: with ECARTIS (v1.0.0; list netdev); Fri, 01 Aug 2003 08:51:20 -0700 (PDT) Received: from blackbird.intercode.com.au (IDENT:TamECck9nHItRCfvPti5PCOPYy+eE6V7@blackbird.intercode.com.au [203.32.101.10]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h71FpEFl029615 for ; Fri, 1 Aug 2003 08:51:16 -0700 Received: from excalibur.intercode.com.au (excalibur.intercode.com.au [203.32.101.12]) by blackbird.intercode.com.au (8.11.6p2/8.9.3) with ESMTP id h71Fowr27206; Sat, 2 Aug 2003 01:50:58 +1000 Date: Sat, 2 Aug 2003 01:50:57 +1000 (EST) From: James Morris To: Zwane Mwaikambo cc: netdev@oss.sgi.com Subject: Re: oops in raw_rcv_skb In-Reply-To: Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII X-archive-position: 4423 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: jmorris@intercode.com.au Precedence: bulk X-list: netdev On Fri, 1 Aug 2003, Zwane Mwaikambo wrote: > You can reproduce this one easily by doing 5-6 ping -f of a system on the > network (not loopback), originally picked up at http://bugme.osdl.org/show_bug.cgi?id=937 Any chance of getting a gdb traceback on this one? :-) - James -- James Morris From garzik@gtf.org Fri Aug 1 09:25:42 2003 Received: with ECARTIS (v1.0.0; list netdev); Fri, 01 Aug 2003 09:25:48 -0700 (PDT) Received: from havoc.gtf.org (host-64-213-145-173.atlantasolutions.com [64.213.145.173] (may be forged)) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h71GPfFl001543 for ; Fri, 1 Aug 2003 09:25:42 -0700 Received: by havoc.gtf.org (Postfix, from userid 500) id 6EBE76696; Fri, 1 Aug 2003 12:25:36 -0400 (EDT) Date: Fri, 1 Aug 2003 12:25:36 -0400 From: Jeff Garzik To: Matthew Wilcox Cc: netdev@oss.sgi.com Subject: Re: [PATCH] ethtool_ops rev 4 Message-ID: <20030801162536.GA18574@gtf.org> References: <20030801150232.GV22222@parcelfarce.linux.theplanet.co.uk> <20030801154021.GA7696@gtf.org> <20030801154656.GW22222@parcelfarce.linux.theplanet.co.uk> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20030801154656.GW22222@parcelfarce.linux.theplanet.co.uk> User-Agent: Mutt/1.3.28i X-archive-position: 4424 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: jgarzik@pobox.com Precedence: bulk X-list: netdev On Fri, Aug 01, 2003 at 04:46:56PM +0100, Matthew Wilcox wrote: > On Fri, Aug 01, 2003 at 11:40:21AM -0400, Jeff Garzik wrote: > > Comments: > > > > * need SET_ETHTOOL_OPS macro or HAVE_ETHTOOL_OPS test macro or similar > > DaveM disagreed with that... It's standard netdevice.h practice, and, he didn't disagree w/ my rebuttal. It is needed. > > * I still do not see the need to change a simple storage of a constant > > (into ethtool_gdrvinfo) into _four_ separate function call hooks (reg > > dump len, eeprom dump len, nic-specific stats len, self-test len). > > Internal kernel code that needs this information is always a slow path > > anyway, so just call the ->get_drvinfo hook internally. > > slow path, sure, but increased stack usage. it's a tradeoff, and this way > feels more clean to me. Additing a function hook each time you want to retrieve a new integer value? That's feels overly excessive to me. > > * I prefer not to add '#include ' to ethtool.h > > That means that any code which includes ethtool.h has to include types.h > first (either implicitly or explicitly). The rule so far has been that > header files should call out their dependencies explictly with an include > of the appropriate file. So why *don't* you want it? Because I copy it to userspace :) Jeff From zwane@arm.linux.org.uk Fri Aug 1 09:26:24 2003 Received: with ECARTIS (v1.0.0; list netdev); Fri, 01 Aug 2003 09:26:30 -0700 (PDT) Received: from hemi.commfireservices.com ([66.212.224.118]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h71GQMFl002069 for ; Fri, 1 Aug 2003 09:26:23 -0700 Received: from montezuma.mastecende.com (cuda.commfireservices.com [24.202.53.9]) by hemi.commfireservices.com (Postfix) with ESMTP id 0AB23BC54; Fri, 1 Aug 2003 12:15:16 -0400 (EDT) Received: from localhost.localdomain (localhost.localdomain [127.0.0.1]) by montezuma.mastecende.com (8.12.8/8.12.8) with ESMTP id h71GEftE031939; Fri, 1 Aug 2003 12:14:42 -0400 Date: Fri, 1 Aug 2003 12:14:41 -0400 (EDT) From: Zwane Mwaikambo X-X-Sender: zwane@montezuma.mastecende.com To: James Morris Cc: netdev@oss.sgi.com Subject: Re: oops in raw_rcv_skb In-Reply-To: Message-ID: References: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII X-archive-position: 4425 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: zwane@arm.linux.org.uk Precedence: bulk X-list: netdev On Sat, 2 Aug 2003, James Morris wrote: > On Fri, 1 Aug 2003, Zwane Mwaikambo wrote: > > > You can reproduce this one easily by doing 5-6 ping -f of a system on the > > network (not loopback), originally picked up at http://bugme.osdl.org/show_bug.cgi?id=937 > > Any chance of getting a gdb traceback on this one? :-) Here is a new oops with the corresponding code. 2.6.0-test2-mm2 (gdb) list *raw_rcv_skb+0x1b5 0xc04e2235 is in raw_rcv_skb (sock.h:942). 937 938 skb->dev = NULL; 939 skb_set_owner_r(skb, sk); 940 skb_queue_tail(&sk->sk_receive_queue, skb); 941 if (!sock_flag(sk, SOCK_DEAD)) 942 sk->sk_data_ready(sk, skb->len); 943 out: 944 return err; 945 } Unable to handle kernel paging request at virtual address c3148068 printing eip: c04e2235 *pde = 0000d067 Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC CPU: 0 EIP: 0060:[] Not tainted VLI EFLAGS: 00010246 EIP is at raw_rcv_skb+0x1b5/0x270 eax: 00000000 ebx: 00000104 ecx: 00000104 edx: 00000001 esi: c7fae004 edi: 00000000 ebp: c3148004 esp: cbf4fecc ds: 007b es: 007b ss: 0068 Process ksoftirqd/0 (pid: 3, threadinfo=cbf4e000 task=cbf81000) Stack: c7fae06c cbf4e000 00000206 00000000 c3148000 0000005a c3148004 c96e7024 c7fae004 cab51004 c04e237c c7fae004 c3148004 00000020 c7fae004 c96e7024 c04e1ead c7fae004 c3148004 00000001 ca214004 cab51004 0a00a8c0 c04bd389 Call Trace: [] raw_rcv+0x8c/0xe0 [] raw_v4_input+0xbd/0x150 [] ip_local_deliver+0xc9/0x270 [] ip_rcv+0x37c/0x4e0 [] netif_receive_skb+0x153/0x1d0 [] process_backlog+0x87/0x160 [] net_rx_action+0x84/0x160 [] do_softirq+0xd3/0xe0 [] ksoftirqd+0xbc/0x100 [] ksoftirqd+0x0/0x100 [] kernel_thread_helper+0x5/0x10 Code: 43 86 56 68 ff 74 24 08 9d 8b 54 24 04 8b 5a 14 4b 89 5a 14 8b 42 08 83 e0 08 <0>Kernel panic: Fatal exception in interrupt In interrupt handler - not syncing -- function.linuxpower.ca From zwane@arm.linux.org.uk Fri Aug 1 09:29:54 2003 Received: with ECARTIS (v1.0.0; list netdev); Fri, 01 Aug 2003 09:29:57 -0700 (PDT) Received: from hemi.commfireservices.com ([66.212.224.118]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h71GTrFl002789 for ; Fri, 1 Aug 2003 09:29:53 -0700 Received: from montezuma.mastecende.com (cuda.commfireservices.com [24.202.53.9]) by hemi.commfireservices.com (Postfix) with ESMTP id 38570BC56; Fri, 1 Aug 2003 12:18:47 -0400 (EDT) Received: from localhost.localdomain (localhost.localdomain [127.0.0.1]) by montezuma.mastecende.com (8.12.8/8.12.8) with ESMTP id h71GIDtE031957; Fri, 1 Aug 2003 12:18:13 -0400 Date: Fri, 1 Aug 2003 12:18:13 -0400 (EDT) From: Zwane Mwaikambo X-X-Sender: zwane@montezuma.mastecende.com To: James Morris Cc: netdev@oss.sgi.com Subject: Re: oops in raw_rcv_skb In-Reply-To: Message-ID: References: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII X-archive-position: 4426 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: zwane@arm.linux.org.uk Precedence: bulk X-list: netdev On Fri, 1 Aug 2003, Zwane Mwaikambo wrote: > Here is a new oops with the corresponding code. 2.6.0-test2-mm2 > > (gdb) list *raw_rcv_skb+0x1b5 > 0xc04e2235 is in raw_rcv_skb (sock.h:942). > 937 > 938 skb->dev = NULL; > 939 skb_set_owner_r(skb, sk); > 940 skb_queue_tail(&sk->sk_receive_queue, skb); > 941 if (!sock_flag(sk, SOCK_DEAD)) > 942 sk->sk_data_ready(sk, skb->len); > 943 out: > 944 return err; > 945 } seems to be the same bug as the previous one i posted. From nebuchadnezzar@nerim.net Fri Aug 1 10:53:35 2003 Received: with ECARTIS (v1.0.0; list netdev); Fri, 01 Aug 2003 10:53:43 -0700 (PDT) Received: from cerbere (nebuchadnezzar.net1.nerim.net [213.41.153.130]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h71HrVFl009240 for ; Fri, 1 Aug 2003 10:53:35 -0700 Received: from [2001:7a8:5982:1:209:5bff:fe1c:f0b8] (helo=zion.matrix) by cerbere with esmtp (Exim 4.20) id 19ie5k-0000Lz-JF for netdev@oss.sgi.com; Fri, 01 Aug 2003 19:53:24 +0200 Received: from localhost ([::1] helo=zion.nerim.net) by zion.matrix with esmtp (Exim 4.20) id 19ie5k-0007tc-0j for netdev@oss.sgi.com; Fri, 01 Aug 2003 19:53:24 +0200 To: netdev@oss.sgi.com Subject: [PATCH] 2.4.x USAGI mipv6_ha_ipsec From: "Daniel 'NebuchadnezzaR' Dehennin" Organisation: CaLviX Date: Fri, 01 Aug 2003 19:53:23 +0200 Message-ID: <87n0etgt7w.fsf@zion.matrix> User-Agent: Gnus/5.1002 (Gnus v5.10.2) Emacs/21.3 (gnu/linux) MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-archive-position: 4427 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: nebuchadnezzar@nerim.net Precedence: bulk X-list: netdev Hello, My 2.4.21 with USAGI 20030801 don't build /net/ipv6/mobile_ip6/mipv6_ha_ipsec.c : mipv6_ha_ipsec.c: In function `mipv6_change_sa_index': mipv6_ha_ipsec.c:118: warning: implicit declaration of function `in6_ntop' mipv6_ha_ipsec.c:118: warning: format argument is not a pointer (arg 4) mipv6_ha_ipsec.c:119: warning: format argument is not a pointer (arg 4) mipv6_ha_ipsec.c:126: warning: format argument is not a pointer (arg 4) mipv6_ha_ipsec.c:127: warning: format argument is not a pointer (arg 4) [...] I search for the definition of in6_ntop, it in include/linux/inet.h so I make that patch. Thanks. --- linux-2.4.21/net/ipv6/mobile_ip6/mipv6_ha_ipsec.c.orig 2003-08-01 19:37:22.000000000 +0200 +++ linux-2.4.21/net/ipv6/mobile_ip6/mipv6_ha_ipsec.c 2003-08-01 19:03:42.000000000 +0200 @@ -62,6 +62,7 @@ #include #include #include +#include #include #include #include -- Daniel 'NebuchadnezzaR' Dehennin From nebuchadnezzar@nerim.net Fri Aug 1 11:09:33 2003 Received: with ECARTIS (v1.0.0; list netdev); Fri, 01 Aug 2003 11:09:36 -0700 (PDT) Received: from cerbere (nebuchadnezzar.net1.nerim.net [213.41.153.130]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h71I9VFl010664 for ; Fri, 1 Aug 2003 11:09:32 -0700 Received: from zion.matrix ([2001:7a8:5982:1:209:5bff:fe1c:f0b8]) by cerbere with esmtp (Exim 4.20) id 19ieLF-0000M2-NW for netdev@oss.sgi.com; Fri, 01 Aug 2003 20:09:25 +0200 Received: from localhost ([::1] helo=zion.nerim.net) by zion.matrix with esmtp (Exim 4.20) id 19ieLF-0007xu-Db for netdev@oss.sgi.com; Fri, 01 Aug 2003 20:09:25 +0200 To: Linux Networking List Subject: [PATCH 2] 2.4.x USAGI unused variables in mipv6_ha_ipsec.c From: "Daniel 'NebuchadnezzaR' Dehennin" Organisation: CaLviX Date: Fri, 01 Aug 2003 20:09:25 +0200 Message-ID: <87fzklgsh6.fsf@zion.matrix> User-Agent: Gnus/5.1002 (Gnus v5.10.2) Emacs/21.3 (gnu/linux) MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-archive-position: 4428 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: nebuchadnezzar@nerim.net Precedence: bulk X-list: netdev Hello again ;-), A patch to remove unused variables : mipv6_ha_ipsec.c: In function `__mipv6_mn_change_tunnel_ipsec_by_proto': mipv6_ha_ipsec.c:216: warning: unused variable `ret' mipv6_ha_ipsec.c: In function `__mipv6_ha_change_tunnel_ipsec_by_proto': mipv6_ha_ipsec.c:338: warning: unused variable `ret' See you. --- linux-2.4.21/net/ipv6/mobile_ip6/mipv6_ha_ipsec.c.orig 2003-08-01 20:06:15.000000000 +0200 +++ linux-2.4.21/net/ipv6/mobile_ip6/mipv6_ha_ipsec.c 2003-08-01 20:06:42.000000000 +0200 @@ -213,7 +213,6 @@ int __mipv6_mn_change_tunnel_ipsec_by_pr struct in6_addr dst; struct in6_addr src; struct in6_addr *coa = &entry->coa; - int ret = 0; /* * Phase 1: Change the following SA/SPD @@ -335,7 +334,6 @@ int __mipv6_ha_change_tunnel_ipsec_by_pr struct in6_addr dst; struct in6_addr src; struct in6_addr *coa = &entry->coa; - int ret = 0; /* * Phase 1: Change the following SA/SPD -- Daniel 'NebuchadnezzaR' Dehennin From willy@www.linux.org.uk Fri Aug 1 12:17:11 2003 Received: with ECARTIS (v1.0.0; list netdev); Fri, 01 Aug 2003 12:17:20 -0700 (PDT) Received: from www.linux.org.uk (IDENT:dWRqvOFfILtpyOjLUmme1m8+W8uXBtM7@parcelfarce.linux.theplanet.co.uk [195.92.249.252]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h71JH9Fl005257 for ; Fri, 1 Aug 2003 12:17:10 -0700 Received: from willy by www.linux.org.uk with local (Exim 4.14) id 19ic7M-0004BK-Kb; Fri, 01 Aug 2003 16:46:56 +0100 Date: Fri, 1 Aug 2003 16:46:56 +0100 From: Matthew Wilcox To: Jeff Garzik Cc: Matthew Wilcox , netdev@oss.sgi.com Subject: Re: [PATCH] ethtool_ops rev 4 Message-ID: <20030801154656.GW22222@parcelfarce.linux.theplanet.co.uk> References: <20030801150232.GV22222@parcelfarce.linux.theplanet.co.uk> <20030801154021.GA7696@gtf.org> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20030801154021.GA7696@gtf.org> User-Agent: Mutt/1.4.1i X-archive-position: 4429 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: willy@debian.org Precedence: bulk X-list: netdev On Fri, Aug 01, 2003 at 11:40:21AM -0400, Jeff Garzik wrote: > Comments: > > * need SET_ETHTOOL_OPS macro or HAVE_ETHTOOL_OPS test macro or similar DaveM disagreed with that... > * I still do not see the need to change a simple storage of a constant > (into ethtool_gdrvinfo) into _four_ separate function call hooks (reg > dump len, eeprom dump len, nic-specific stats len, self-test len). > Internal kernel code that needs this information is always a slow path > anyway, so just call the ->get_drvinfo hook internally. slow path, sure, but increased stack usage. it's a tradeoff, and this way feels more clean to me. > * I prefer not to add '#include ' to ethtool.h That means that any code which includes ethtool.h has to include types.h first (either implicitly or explicitly). The rule so far has been that header files should call out their dependencies explictly with an include of the appropriate file. So why *don't* you want it? -- "It's not Hollywood. War is real, war is primarily not about defeat or victory, it is about death. I've seen thousands and thousands of dead bodies. Do you think I want to have an academic debate on this subject?" -- Robert Fisk From garzik@gtf.org Fri Aug 1 12:44:29 2003 Received: with ECARTIS (v1.0.0; list netdev); Fri, 01 Aug 2003 12:44:38 -0700 (PDT) Received: from havoc.gtf.org (host-64-213-145-173.atlantasolutions.com [64.213.145.173] (may be forged)) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h71JiRFl006741 for ; Fri, 1 Aug 2003 12:44:28 -0700 Received: by havoc.gtf.org (Postfix, from userid 500) id 458486698; Fri, 1 Aug 2003 15:44:22 -0400 (EDT) Date: Fri, 1 Aug 2003 15:44:20 -0400 From: Jeff Garzik To: torvalds@osdl.org Cc: linux-kernel@vger.kernel.org, netdev@oss.sgi.com Subject: [BK PATCHES] 2.6.x net driver merges Message-ID: <20030801194420.GD3571@gtf.org> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.3.28i X-archive-position: 4430 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: jgarzik@pobox.com Precedence: bulk X-list: netdev Linus, please do a bk pull bk://gkernel.bkbits.net/net-drivers-2.5 Others may download the patch from ftp://ftp.??.kernel.org/pub/linux/kernel/people/jgarzik/patchkits/2.6/2.6.0-test2-netdrvr1.patch.bz2 This will update the following files: Documentation/networking/bonding.txt | 343 ++++++++++++++++++++++++----------- Documentation/networking/ifenslave.c | 3 drivers/net/arcnet/com20020-isa.c | 2 drivers/net/tokenring/ibmtr.c | 3 drivers/net/wireless/airo.c | 104 +++++++--- 5 files changed, 309 insertions(+), 146 deletions(-) through these ChangeSets: (03/08/01 1.1547.8.10) Cset exclude: jgarzik@redhat.com|ChangeSet|20030731201437|53548 My fix was wrong, and, mainline now has a better fix. (03/07/31 1.1547.8.9) [tokenring ibmtr_cs] fix build, due to missing ibmtr.c build Note: Better fix is needed. Contributed by Mike Phillips. (03/07/31 1.1547.8.8) [arcnet com20020-isa] fix build broken by lack of ->owner (03/07/31 1.1547.8.7) [netdrvr bonding] fix ifenslave build on ia64 Forward port from 2.4. (03/07/31 1.1547.8.6) [netdrvr bonding] update docs (03/07/29 1.1547.8.5) [wireless airo] adds support for noise level reporting (if available) (03/07/29 1.1547.8.4) [wireless airo] makes the card passive when entering monitor mode (03/07/29 1.1547.8.3) [wireless airo] eliminate infinite loop makes sure a possible (never happened, but just in case) infinite loop in the transmission code terminates. (03/07/29 1.1547.8.2) [wireless airo] safer shutdown sequence changes the card shutdown sequence to a safer one (03/07/29 1.1547.8.1) [wireless airo] fix Tx race From davem@redhat.com Fri Aug 1 13:24:31 2003 Received: with ECARTIS (v1.0.0; list netdev); Fri, 01 Aug 2003 13:24:39 -0700 (PDT) Received: from pizda.ninka.net (IDENT:root@pizda.ninka.net [216.101.162.242]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h71KOUFl008291 for ; Fri, 1 Aug 2003 13:24:31 -0700 Received: from pizda.ninka.net (IDENT:davem@localhost.localdomain [127.0.0.1]) by pizda.ninka.net (8.9.3/8.9.3) with SMTP id NAA07798; Fri, 1 Aug 2003 13:20:37 -0700 Date: Fri, 1 Aug 2003 13:20:37 -0700 From: "David S. Miller" To: Jeff Garzik Cc: willy@debian.org, netdev@oss.sgi.com Subject: Re: [PATCH] ethtool_ops rev 4 Message-Id: <20030801132037.3f3542ae.davem@redhat.com> In-Reply-To: <20030801162536.GA18574@gtf.org> References: <20030801150232.GV22222@parcelfarce.linux.theplanet.co.uk> <20030801154021.GA7696@gtf.org> <20030801154656.GW22222@parcelfarce.linux.theplanet.co.uk> <20030801162536.GA18574@gtf.org> X-Mailer: Sylpheed version 0.9.2 (GTK+ 1.2.6; sparc-unknown-linux-gnu) Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit X-archive-position: 4431 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: davem@redhat.com Precedence: bulk X-list: netdev On Fri, 1 Aug 2003 12:25:36 -0400 Jeff Garzik wrote: > On Fri, Aug 01, 2003 at 04:46:56PM +0100, Matthew Wilcox wrote: > > On Fri, Aug 01, 2003 at 11:40:21AM -0400, Jeff Garzik wrote: > > > Comments: > > > > > > * need SET_ETHTOOL_OPS macro or HAVE_ETHTOOL_OPS test macro or similar > > > > DaveM disagreed with that... > > It's standard netdevice.h practice, and, he didn't disagree w/ my > rebuttal. > > It is needed. Absolutely not, it makes no sense whatsoever to have this. Jeff, stop and think. The whole _POINT_ of these ops are to avoid duplicated code. If someone is absolutely adament about supporting kernels without ops support they should not support it at all. The point is to avoid code duplication, but what you suggest can only be used to keep the duplicated code around "just in case". This makes exactly no sense at all, it severs only to defeat the whole purpose of the change in the first place. I totally am against making an ifdef test available for this, it can only result in illogical things being done by driver maintainers. From jgarzik@pobox.com Fri Aug 1 15:35:45 2003 Received: with ECARTIS (v1.0.0; list netdev); Fri, 01 Aug 2003 15:35:49 -0700 (PDT) Received: from www.linux.org.uk (IDENT:rg5SenHWUdrUH12rD8TwSji2gubj2u02@parcelfarce.linux.theplanet.co.uk [195.92.249.252]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h71MZiFl015583 for ; Fri, 1 Aug 2003 15:35:44 -0700 Received: from rdu26-227-011.nc.rr.com ([66.26.227.11] helo=pobox.com) by www.linux.org.uk with esmtp (Exim 4.14) id 19iiUw-0000tp-Gp; Fri, 01 Aug 2003 23:35:42 +0100 Message-ID: <3F2AEB33.9050506@pobox.com> Date: Fri, 01 Aug 2003 18:35:31 -0400 From: Jeff Garzik Organization: none User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.2.1) Gecko/20021213 Debian/1.2.1-2.bunk X-Accept-Language: en MIME-Version: 1.0 To: "David S. Miller" CC: willy@debian.org, netdev@oss.sgi.com Subject: Re: [PATCH] ethtool_ops rev 4 References: <20030801150232.GV22222@parcelfarce.linux.theplanet.co.uk> <20030801154021.GA7696@gtf.org> <20030801154656.GW22222@parcelfarce.linux.theplanet.co.uk> <20030801162536.GA18574@gtf.org> <20030801132037.3f3542ae.davem@redhat.com> <3F2AE91D.5090705@pobox.com> In-Reply-To: <3F2AE91D.5090705@pobox.com> Content-Type: text/plain; charset=us-ascii; format=flowed Content-Transfer-Encoding: 7bit X-archive-position: 4432 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: jgarzik@pobox.com Precedence: bulk X-list: netdev Jeff Garzik wrote: > It's an explicit goal to avoid changing the driver API in such a way > that there is a remotely sane path to supporting older kernels. I, of course, meant the exact opposite here :) We want to provide a sane, ifdef-free path to kcompat, where feasible. From davem@redhat.com Fri Aug 1 15:36:55 2003 Received: with ECARTIS (v1.0.0; list netdev); Fri, 01 Aug 2003 15:37:00 -0700 (PDT) Received: from pizda.ninka.net (IDENT:root@pizda.ninka.net [216.101.162.242]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h71MatFl015832 for ; Fri, 1 Aug 2003 15:36:55 -0700 Received: from pizda.ninka.net (IDENT:davem@localhost.localdomain [127.0.0.1]) by pizda.ninka.net (8.9.3/8.9.3) with SMTP id PAA08138; Fri, 1 Aug 2003 15:32:55 -0700 Date: Fri, 1 Aug 2003 15:32:55 -0700 From: "David S. Miller" To: Jeff Garzik Cc: willy@debian.org, netdev@oss.sgi.com Subject: Re: [PATCH] ethtool_ops rev 4 Message-Id: <20030801153255.204baf66.davem@redhat.com> In-Reply-To: <3F2AE91D.5090705@pobox.com> References: <20030801150232.GV22222@parcelfarce.linux.theplanet.co.uk> <20030801154021.GA7696@gtf.org> <20030801154656.GW22222@parcelfarce.linux.theplanet.co.uk> <20030801162536.GA18574@gtf.org> <20030801132037.3f3542ae.davem@redhat.com> <3F2AE91D.5090705@pobox.com> X-Mailer: Sylpheed version 0.9.2 (GTK+ 1.2.6; sparc-unknown-linux-gnu) Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit X-archive-position: 4433 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: davem@redhat.com Precedence: bulk X-list: netdev On Fri, 01 Aug 2003 18:26:37 -0400 Jeff Garzik wrote: > Strangely enough, creating a SET_ETHTOOL_OPS() macro (or > netif_ethtool_ops or pick your name) reduces ifdefs. And then we'll have all of these functions present in the driver, unused, and we'll get tons of warning from gcc. The duplication of code is still there, and this is the main point. > I feel that I've helped shepherd the net driver and PCI APIs to maintain > something fairly interesting: It's not interesting in this case. > It's an explicit goal to avoid changing the driver API in such a way > that there is a remotely sane path to supporting older kernels. This enhancement we're talking about basically has no value unless you accept an appearance of breakage in this particular area. You can't get rid of the duplicated code without accepting that you will have seperate 2.6.x and 2.4.x strains of your driver. If you aren't willing to accept seperate strains of your driver, you simply don't use netdev_ops. It is the end of the conversation. > the few things that is not easily work-around-able is new additions to > existing structures (which wouldn't exist in older kernels). That's > what SET_ETHTOOL_OPS would wrap, while also providing a trigger for > generic compat glue. What gets rid of the static functions that do the work when SET_ETHTOOL_OPS() is a nop? I do not accept a scheme where the functions stay there in the driver anyways. All you seem to be talking about is a compat library which provides netdev_ops in library form or something silly like that. > This (IMO) feature continually saves me real time I don't argue that, just don't use netdev_ops in drivers you wish to keep doing this with :-) Look at drivers/net/acenic.c, that's similar to what your drivers will begin to look like if you don't start accepting a disconnect in certain areas. From davem@redhat.com Fri Aug 1 15:38:33 2003 Received: with ECARTIS (v1.0.0; list netdev); Fri, 01 Aug 2003 15:38:36 -0700 (PDT) Received: from pizda.ninka.net (IDENT:root@pizda.ninka.net [216.101.162.242]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h71McWFl016352 for ; Fri, 1 Aug 2003 15:38:32 -0700 Received: from pizda.ninka.net (IDENT:davem@localhost.localdomain [127.0.0.1]) by pizda.ninka.net (8.9.3/8.9.3) with SMTP id PAA08157; Fri, 1 Aug 2003 15:34:39 -0700 Date: Fri, 1 Aug 2003 15:34:39 -0700 From: "David S. Miller" To: Jeff Garzik Cc: willy@debian.org, netdev@oss.sgi.com Subject: Re: [PATCH] ethtool_ops rev 4 Message-Id: <20030801153439.4a324c36.davem@redhat.com> In-Reply-To: <3F2AEB33.9050506@pobox.com> References: <20030801150232.GV22222@parcelfarce.linux.theplanet.co.uk> <20030801154021.GA7696@gtf.org> <20030801154656.GW22222@parcelfarce.linux.theplanet.co.uk> <20030801162536.GA18574@gtf.org> <20030801132037.3f3542ae.davem@redhat.com> <3F2AE91D.5090705@pobox.com> <3F2AEB33.9050506@pobox.com> X-Mailer: Sylpheed version 0.9.2 (GTK+ 1.2.6; sparc-unknown-linux-gnu) Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit X-archive-position: 4434 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: davem@redhat.com Precedence: bulk X-list: netdev On Fri, 01 Aug 2003 18:35:31 -0400 Jeff Garzik wrote: > We want to provide a sane, ifdef-free path to kcompat, where feasible. I don't believe it's possible with netdev_ops, without undoing the entire purpose of what netdev_ops is trying to accomplish (elimination of code duplication). Show me, in code not words, how you are able to accomplish this with SET_NETDEV_OPS() or whatever. I will not read english text describing the scheme, I will read only code :) From greearb@candelatech.com Fri Aug 1 15:55:02 2003 Received: with ECARTIS (v1.0.0; list netdev); Fri, 01 Aug 2003 15:55:07 -0700 (PDT) Received: from grok.yi.org (evrtwa1-ar2-4-33-045-074.evrtwa1.dsl-verizon.net [4.33.45.74]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h71Mt1Fl017421 for ; Fri, 1 Aug 2003 15:55:02 -0700 Received: from candelatech.com (localhost.localdomain [127.0.0.1]) by grok.yi.org (8.12.8/8.12.8) with ESMTP id h71Msttf013879 for ; Fri, 1 Aug 2003 15:54:56 -0700 Message-ID: <3F2AEFBF.3040604@candelatech.com> Date: Fri, 01 Aug 2003 15:54:55 -0700 From: Ben Greear Organization: Candela Technologies User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.5a) Gecko/20030718 X-Accept-Language: en-us, en MIME-Version: 1.0 To: "'netdev@oss.sgi.com'" Subject: 2.4.21: bug report for tg3: tx lockup when changing MTU Content-Type: text/plain; charset=us-ascii; format=flowed Content-Transfer-Encoding: 7bit X-archive-position: 4435 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: greearb@candelatech.com Precedence: bulk X-list: netdev I just noticed that if you change the MTU of a tg3 NIC, it will not work untill you ifdown/ifup it. This problem is repeatable on tg3, and does not happen with the e1000 driver/cards. I am setting the MTU via an ioctl call, not via ifconfig or something like that. When the tg3 is locked up, I see this on the console: Aug 1 15:05:44 demo2 kernel: NETDEV WATCHDOG: eth5: transmit timed out Aug 1 15:05:44 demo2 kernel: tg3: eth5: transmit timed out, resetting Aug 1 15:05:44 demo2 kernel: tg3: tg3_stop_block timed out, ofs=2400 enable_bit=2 Aug 1 15:05:44 demo2 kernel: tg3: tg3_stop_block timed out, ofs=1400 enable_bit=2 Aug 1 15:05:44 demo2 kernel: tg3: tg3_stop_block timed out, ofs=c00 enable_bit=2 Aug 1 15:05:54 demo2 kernel: NETDEV WATCHDOG: eth5: transmit timed out Aug 1 15:05:54 demo2 kernel: tg3: eth5: transmit timed out, resetting Aug 1 15:05:54 demo2 kernel: tg3: tg3_stop_block timed out, ofs=2400 enable_bit=2 Aug 1 15:05:54 demo2 kernel: tg3: tg3_stop_block timed out, ofs=1400 enable_bit=2 ... Kernel is 2.4.21 + custom patches (which should not affect tg3). lspci says the NIC is: Altima AC9100 (rev 15) I will be happy to provide more information as needed. Thanks, Ben -- Ben Greear Candela Technologies Inc http://www.candelatech.com From jgarzik@pobox.com Fri Aug 1 16:01:35 2003 Received: with ECARTIS (v1.0.0; list netdev); Fri, 01 Aug 2003 16:01:40 -0700 (PDT) Received: from www.linux.org.uk (IDENT:Zn4088h0c9junvtyMz48dB3BmWWWI3H8@parcelfarce.linux.theplanet.co.uk [195.92.249.252]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h71N1YFl018056 for ; Fri, 1 Aug 2003 16:01:35 -0700 Received: from rdu26-227-011.nc.rr.com ([66.26.227.11] helo=pobox.com) by www.linux.org.uk with esmtp (Exim 4.14) id 19iitw-000162-PQ; Sat, 02 Aug 2003 00:01:32 +0100 Message-ID: <3F2AF141.2010308@pobox.com> Date: Fri, 01 Aug 2003 19:01:21 -0400 From: Jeff Garzik Organization: none User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.2.1) Gecko/20021213 Debian/1.2.1-2.bunk X-Accept-Language: en MIME-Version: 1.0 To: "David S. Miller" CC: willy@debian.org, netdev@oss.sgi.com Subject: Re: [PATCH] ethtool_ops rev 4 References: <20030801150232.GV22222@parcelfarce.linux.theplanet.co.uk> <20030801154021.GA7696@gtf.org> <20030801154656.GW22222@parcelfarce.linux.theplanet.co.uk> <20030801162536.GA18574@gtf.org> <20030801132037.3f3542ae.davem@redhat.com> <3F2AE91D.5090705@pobox.com> <20030801153255.204baf66.davem@redhat.com> In-Reply-To: <20030801153255.204baf66.davem@redhat.com> Content-Type: text/plain; charset=us-ascii; format=flowed Content-Transfer-Encoding: 7bit X-archive-position: 4436 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: jgarzik@pobox.com Precedence: bulk X-list: netdev David S. Miller wrote: > On Fri, 01 Aug 2003 18:26:37 -0400 > Jeff Garzik wrote: > > >>Strangely enough, creating a SET_ETHTOOL_OPS() macro (or >>netif_ethtool_ops or pick your name) reduces ifdefs. > > > And then we'll have all of these functions present in > the driver, unused, and we'll get tons of warning from > gcc. > > The duplication of code is still there, and this is the > main point. Not correct: there is nothing unused, there are no warnings, in either the in-kernel case or the older-kernel case. Look at kcompat. That is code that is working, and producing the 2.4/2.6-ready vendor drivers I spoke of. I'm apparently not communicating the design that exists in kcompat, if you think this. The design is: code for 2.6, and it magically works in 2.4 It's a back-compat system that is so good you don't even know it's there. It's completely invisible to the mainline kernel -- as it should be -- presuming that one pays attention to subtle API change effects. Do you see yet how there is no code duplication, no ifdefs, no warnings about unused functions? That is the key point of the whole design, and key to the thread of discussion here. > You can't get rid of the duplicated code without accepting that you > will have seperate 2.6.x and 2.4.x strains of your driver. > > If you aren't willing to accept seperate strains of your driver, you > simply don't use netdev_ops. Look at kcompat. That is real, working code that demonstrates the approach. >>the few things that is not easily work-around-able is new additions to >>existing structures (which wouldn't exist in older kernels). That's >>what SET_ETHTOOL_OPS would wrap, while also providing a trigger for >>generic compat glue. > > > What gets rid of the static functions that do the work when > SET_ETHTOOL_OPS() is a nop? SET_ETHTOOL_OPS is never a no-op. The back-compat form of SET_ETHTOOL_OPS registers the ethtool_ops pointer in storage for later use. A DO_ETHTOOL_OPS macro in the driver's ->do_ioctl -- intentionally not included in the kernel -- does the rest, calling kcompat's backported net/core/ethtool.c, which in turn calls the ethtool_ops hooks in the driver. Making the kcompat'd net driver ready for 2.6 would then involve simply deleting one line. That's why there is no code duplication or unused driver code. Jeff From davem@redhat.com Fri Aug 1 16:05:34 2003 Received: with ECARTIS (v1.0.0; list netdev); Fri, 01 Aug 2003 16:05:37 -0700 (PDT) Received: from pizda.ninka.net (IDENT:root@pizda.ninka.net [216.101.162.242]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h71N5XFl018589 for ; Fri, 1 Aug 2003 16:05:33 -0700 Received: from pizda.ninka.net (IDENT:davem@localhost.localdomain [127.0.0.1]) by pizda.ninka.net (8.9.3/8.9.3) with SMTP id QAA08249; Fri, 1 Aug 2003 16:01:36 -0700 Date: Fri, 1 Aug 2003 16:01:36 -0700 From: "David S. Miller" To: Jeff Garzik Cc: willy@debian.org, netdev@oss.sgi.com Subject: Re: [PATCH] ethtool_ops rev 4 Message-Id: <20030801160136.3342c5cc.davem@redhat.com> In-Reply-To: <3F2AF141.2010308@pobox.com> References: <20030801150232.GV22222@parcelfarce.linux.theplanet.co.uk> <20030801154021.GA7696@gtf.org> <20030801154656.GW22222@parcelfarce.linux.theplanet.co.uk> <20030801162536.GA18574@gtf.org> <20030801132037.3f3542ae.davem@redhat.com> <3F2AE91D.5090705@pobox.com> <20030801153255.204baf66.davem@redhat.com> <3F2AF141.2010308@pobox.com> X-Mailer: Sylpheed version 0.9.2 (GTK+ 1.2.6; sparc-unknown-linux-gnu) Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit X-archive-position: 4437 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: davem@redhat.com Precedence: bulk X-list: netdev On Fri, 01 Aug 2003 19:01:21 -0400 Jeff Garzik wrote: > A DO_ETHTOOL_OPS macro in the driver's ->do_ioctl -- intentionally not > included in the kernel -- does the rest, I don't understand. Where does this DO_ETHTOOL_OPS macro come from? Is it defined by kcompat? If so, how will drivers in vanilla 2.4.x trees end up with the DO_ETHTOOL_OPS define? From davem@redhat.com Fri Aug 1 16:12:50 2003 Received: with ECARTIS (v1.0.0; list netdev); Fri, 01 Aug 2003 16:12:52 -0700 (PDT) Received: from pizda.ninka.net (IDENT:root@pizda.ninka.net [216.101.162.242]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h71NCnFl019319 for ; Fri, 1 Aug 2003 16:12:49 -0700 Received: from pizda.ninka.net (IDENT:davem@localhost.localdomain [127.0.0.1]) by pizda.ninka.net (8.9.3/8.9.3) with SMTP id QAA08283; Fri, 1 Aug 2003 16:08:57 -0700 Date: Fri, 1 Aug 2003 16:08:57 -0700 From: "David S. Miller" To: Jeff Garzik Cc: willy@debian.org, netdev@oss.sgi.com Subject: Re: [PATCH] ethtool_ops rev 4 Message-Id: <20030801160857.32ebbf22.davem@redhat.com> In-Reply-To: <3F2AF32F.7090201@pobox.com> References: <20030801150232.GV22222@parcelfarce.linux.theplanet.co.uk> <20030801154021.GA7696@gtf.org> <20030801154656.GW22222@parcelfarce.linux.theplanet.co.uk> <20030801162536.GA18574@gtf.org> <20030801132037.3f3542ae.davem@redhat.com> <3F2AE91D.5090705@pobox.com> <3F2AEB33.9050506@pobox.com> <20030801153439.4a324c36.davem@redhat.com> <3F2AF32F.7090201@pobox.com> X-Mailer: Sylpheed version 0.9.2 (GTK+ 1.2.6; sparc-unknown-linux-gnu) Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit X-archive-position: 4438 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: davem@redhat.com Precedence: bulk X-list: netdev On Fri, 01 Aug 2003 19:09:35 -0400 Jeff Garzik wrote: > #define SET_ETHTOOL_OPS kcompat_set_ethtool_ops > > #define DO_ETHTOOL_OPS /* duplicate net/core/ethtool.c, basically */ Where does kcompat_set_ethtool_ops store the pointer if it does not exist in struct netdevice? From jgarzik@pobox.com Fri Aug 1 16:18:11 2003 Received: with ECARTIS (v1.0.0; list netdev); Fri, 01 Aug 2003 16:18:15 -0700 (PDT) Received: from www.linux.org.uk (IDENT:iDc7ycOqp9NNPy2+dMfDWg8UaR/Y+gOS@parcelfarce.linux.theplanet.co.uk [195.92.249.252]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h71NIAFl019905 for ; Fri, 1 Aug 2003 16:18:10 -0700 Received: from rdu26-227-011.nc.rr.com ([66.26.227.11] helo=pobox.com) by www.linux.org.uk with esmtp (Exim 4.14) id 19ijA0-0001GK-DD; Sat, 02 Aug 2003 00:18:08 +0100 Message-ID: <3F2AF525.3000605@pobox.com> Date: Fri, 01 Aug 2003 19:17:57 -0400 From: Jeff Garzik Organization: none User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.2.1) Gecko/20021213 Debian/1.2.1-2.bunk X-Accept-Language: en MIME-Version: 1.0 To: "David S. Miller" CC: willy@debian.org, netdev@oss.sgi.com Subject: Re: [PATCH] ethtool_ops rev 4 References: <20030801150232.GV22222@parcelfarce.linux.theplanet.co.uk> <20030801154021.GA7696@gtf.org> <20030801154656.GW22222@parcelfarce.linux.theplanet.co.uk> <20030801162536.GA18574@gtf.org> <20030801132037.3f3542ae.davem@redhat.com> <3F2AE91D.5090705@pobox.com> <20030801153255.204baf66.davem@redhat.com> <3F2AF141.2010308@pobox.com> <20030801160136.3342c5cc.davem@redhat.com> In-Reply-To: <20030801160136.3342c5cc.davem@redhat.com> Content-Type: text/plain; charset=us-ascii; format=flowed Content-Transfer-Encoding: 7bit X-archive-position: 4439 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: jgarzik@pobox.com Precedence: bulk X-list: netdev David S. Miller wrote: > On Fri, 01 Aug 2003 19:01:21 -0400 > Jeff Garzik wrote: > > >>A DO_ETHTOOL_OPS macro in the driver's ->do_ioctl -- intentionally not >>included in the kernel -- does the rest, > > > I don't understand. > > Where does this DO_ETHTOOL_OPS macro come from? Is it defined > by kcompat? If so, how will drivers in vanilla 2.4.x trees end > up with the DO_ETHTOOL_OPS define? If one wishes to implement kcompat design ("it looks like a 2.6 driver"), then you have two needs over and above Matthew's current ethtool_ops patch: (1) naked struct deref of netdev->ethtool_ops will break immediately on older kernels, and (2) to avoid code duplication, you need to insert a call to kcompat's do_ethtool_handling_the_old_way... i.e. basically what net/core/ethtool.c does now. Problem #1 is solved with a wrapper macro that disguises the naked struct deref to ->ethtool_ops. Problem #2 is solved by adding a call to DO_ETHTOOL_OPS macro in a driver's ->do_ioctl handler. So, with those two minor changes, a 2.6 driver will work on an older kernel. To answer your question above, DO_ETHTOOL_OPS can occur one of two ways: (1) my preferred approach, define a no-op DO_ETHTOOL_OPS macro in-kernel -- but I did not think this would get accepted, so I chose (2) DO_ETHTOOL_OPS exists entirely in kcompat, and people submitting kcompat users to mainline would simply delete the one line calling DO_ETHTOOL_OPS. Solution #2 chooses to create a tiny bit more merge-to-mainline pain, but also keeps the mainline kernel drivers more clean. Jeff From davem@redhat.com Fri Aug 1 16:23:31 2003 Received: with ECARTIS (v1.0.0; list netdev); Fri, 01 Aug 2003 16:23:34 -0700 (PDT) Received: from pizda.ninka.net (IDENT:root@pizda.ninka.net [216.101.162.242]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h71NNVFl020521 for ; Fri, 1 Aug 2003 16:23:31 -0700 Received: from pizda.ninka.net (IDENT:davem@localhost.localdomain [127.0.0.1]) by pizda.ninka.net (8.9.3/8.9.3) with SMTP id QAA08330; Fri, 1 Aug 2003 16:19:38 -0700 Date: Fri, 1 Aug 2003 16:19:37 -0700 From: "David S. Miller" To: Jeff Garzik Cc: willy@debian.org, netdev@oss.sgi.com Subject: Re: [PATCH] ethtool_ops rev 4 Message-Id: <20030801161937.1d9a7126.davem@redhat.com> In-Reply-To: <3F2AF525.3000605@pobox.com> References: <20030801150232.GV22222@parcelfarce.linux.theplanet.co.uk> <20030801154021.GA7696@gtf.org> <20030801154656.GW22222@parcelfarce.linux.theplanet.co.uk> <20030801162536.GA18574@gtf.org> <20030801132037.3f3542ae.davem@redhat.com> <3F2AE91D.5090705@pobox.com> <20030801153255.204baf66.davem@redhat.com> <3F2AF141.2010308@pobox.com> <20030801160136.3342c5cc.davem@redhat.com> <3F2AF525.3000605@pobox.com> X-Mailer: Sylpheed version 0.9.2 (GTK+ 1.2.6; sparc-unknown-linux-gnu) Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit X-archive-position: 4440 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: davem@redhat.com Precedence: bulk X-list: netdev On Fri, 01 Aug 2003 19:17:57 -0400 Jeff Garzik wrote: > Solution #2 chooses to create a tiny bit more > merge-to-mainline pain, but also keeps the mainline kernel drivers more > clean. You don't need DO_ETHTOOL_OPS and thus the merge-to-mainline pain at all if you do something like: 1) SET_ETHDEV_OPS() also overrides the ->do_ioctl() setting to a kcompat_netdev_ioctl() one, but remembers the original pointer somewhere. 2) kcompat_netdev_ioctl() does the things DO_ETHTOOL_OPS would have done, failing that it calls the saved ->do_ioctl() pointer. From jgarzik@pobox.com Fri Aug 1 16:35:34 2003 Received: with ECARTIS (v1.0.0; list netdev); Fri, 01 Aug 2003 16:35:39 -0700 (PDT) Received: from www.linux.org.uk (IDENT:J3yO/Z/hGYogdXpUd6WP37Z2oMWhXKE3@parcelfarce.linux.theplanet.co.uk [195.92.249.252]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h71NZWFl021452 for ; Fri, 1 Aug 2003 16:35:33 -0700 Received: from rdu26-227-011.nc.rr.com ([66.26.227.11] helo=pobox.com) by www.linux.org.uk with esmtp (Exim 4.14) id 19ijQp-0001NP-4x; Sat, 02 Aug 2003 00:35:31 +0100 Message-ID: <3F2AF938.7050608@pobox.com> Date: Fri, 01 Aug 2003 19:35:20 -0400 From: Jeff Garzik Organization: none User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.2.1) Gecko/20021213 Debian/1.2.1-2.bunk X-Accept-Language: en MIME-Version: 1.0 To: "David S. Miller" CC: willy@debian.org, netdev@oss.sgi.com Subject: Re: [PATCH] ethtool_ops rev 4 References: <20030801150232.GV22222@parcelfarce.linux.theplanet.co.uk> <20030801154021.GA7696@gtf.org> <20030801154656.GW22222@parcelfarce.linux.theplanet.co.uk> <20030801162536.GA18574@gtf.org> <20030801132037.3f3542ae.davem@redhat.com> <3F2AE91D.5090705@pobox.com> <3F2AEB33.9050506@pobox.com> <20030801153439.4a324c36.davem@redhat.com> <3F2AF32F.7090201@pobox.com> <20030801160857.32ebbf22.davem@redhat.com> In-Reply-To: <20030801160857.32ebbf22.davem@redhat.com> Content-Type: text/plain; charset=us-ascii; format=flowed Content-Transfer-Encoding: 7bit X-archive-position: 4441 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: jgarzik@pobox.com Precedence: bulk X-list: netdev David S. Miller wrote: > On Fri, 01 Aug 2003 19:09:35 -0400 > Jeff Garzik wrote: > > >>#define SET_ETHTOOL_OPS kcompat_set_ethtool_ops >> >>#define DO_ETHTOOL_OPS /* duplicate net/core/ethtool.c, basically */ > > > Where does kcompat_set_ethtool_ops store the pointer if > it does not exist in struct netdevice? Inside an area allocated by the kcompat lib. SET_ETHTOOL_OPS takes 'struct net_device *' and 'struct ethtool_ops *' arguments, so it simply needs to create a lookup list/table somewhere. You keep asking for code, read kcompat :) kcompat_set_ethtool_ops has exactly the same task as the 2.2.x-era backcompat implementation of pci_{get,set}_drvdata. The perfect back-porting/back-compat system would magically make all Linus-tree drivers work without any change on older kernels. I really think the kcompat design is as close as you can come to that. Here is a linux-kernel-friendly version of the kcompat design: "naked struct derefs hurt. otherwise, happy hacking!" And further, experience shows that the number of naked struct derefs that matter is fairly small. (Another less-common area that hurts besides naked-struct-deref is function return type, which is why Linus created irqreturn_t) Jeff From davem@redhat.com Fri Aug 1 16:38:11 2003 Received: with ECARTIS (v1.0.0; list netdev); Fri, 01 Aug 2003 16:38:19 -0700 (PDT) Received: from pizda.ninka.net (IDENT:root@pizda.ninka.net [216.101.162.242]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h71NcAFl021877 for ; Fri, 1 Aug 2003 16:38:10 -0700 Received: from pizda.ninka.net (IDENT:davem@localhost.localdomain [127.0.0.1]) by pizda.ninka.net (8.9.3/8.9.3) with SMTP id QAA08362; Fri, 1 Aug 2003 16:34:15 -0700 Date: Fri, 1 Aug 2003 16:34:15 -0700 From: "David S. Miller" To: Jeff Garzik Cc: willy@debian.org, netdev@oss.sgi.com Subject: Re: [PATCH] ethtool_ops rev 4 Message-Id: <20030801163415.1c3fd6fb.davem@redhat.com> In-Reply-To: <3F2AF938.7050608@pobox.com> References: <20030801150232.GV22222@parcelfarce.linux.theplanet.co.uk> <20030801154021.GA7696@gtf.org> <20030801154656.GW22222@parcelfarce.linux.theplanet.co.uk> <20030801162536.GA18574@gtf.org> <20030801132037.3f3542ae.davem@redhat.com> <3F2AE91D.5090705@pobox.com> <3F2AEB33.9050506@pobox.com> <20030801153439.4a324c36.davem@redhat.com> <3F2AF32F.7090201@pobox.com> <20030801160857.32ebbf22.davem@redhat.com> <3F2AF938.7050608@pobox.com> X-Mailer: Sylpheed version 0.9.2 (GTK+ 1.2.6; sparc-unknown-linux-gnu) Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit X-archive-position: 4442 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: davem@redhat.com Precedence: bulk X-list: netdev On Fri, 01 Aug 2003 19:35:20 -0400 Jeff Garzik wrote: > Inside an area allocated by the kcompat lib. SET_ETHTOOL_OPS takes > 'struct net_device *' and 'struct ethtool_ops *' arguments, so it simply > needs to create a lookup list/table somewhere. Ok ok ok, we're converging :-) Please just comment on my other email suggesting a way to do away with DO_ETHTOOL_OPS. I'm OK with a SET_ETHTOOL_OPS() macro. From davem@redhat.com Fri Aug 1 16:47:22 2003 Received: with ECARTIS (v1.0.0; list netdev); Fri, 01 Aug 2003 16:47:29 -0700 (PDT) Received: from pizda.ninka.net (IDENT:root@pizda.ninka.net [216.101.162.242]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h71NlMFl022685 for ; Fri, 1 Aug 2003 16:47:22 -0700 Received: from pizda.ninka.net (IDENT:davem@localhost.localdomain [127.0.0.1]) by pizda.ninka.net (8.9.3/8.9.3) with SMTP id QAA08392; Fri, 1 Aug 2003 16:43:29 -0700 Date: Fri, 1 Aug 2003 16:43:28 -0700 From: "David S. Miller" To: Jeff Garzik Cc: willy@debian.org, netdev@oss.sgi.com Subject: Re: [PATCH] ethtool_ops rev 4 Message-Id: <20030801164328.5b5bc145.davem@redhat.com> In-Reply-To: <3F2AFAF4.3040604@pobox.com> References: <20030801150232.GV22222@parcelfarce.linux.theplanet.co.uk> <20030801154021.GA7696@gtf.org> <20030801154656.GW22222@parcelfarce.linux.theplanet.co.uk> <20030801162536.GA18574@gtf.org> <20030801132037.3f3542ae.davem@redhat.com> <3F2AE91D.5090705@pobox.com> <20030801153255.204baf66.davem@redhat.com> <3F2AF141.2010308@pobox.com> <20030801160136.3342c5cc.davem@redhat.com> <3F2AF525.3000605@pobox.com> <20030801161937.1d9a7126.davem@redhat.com> <3F2AFAF4.3040604@pobox.com> X-Mailer: Sylpheed version 0.9.2 (GTK+ 1.2.6; sparc-unknown-linux-gnu) Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit X-archive-position: 4443 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: davem@redhat.com Precedence: bulk X-list: netdev On Fri, 01 Aug 2003 19:42:44 -0400 Jeff Garzik wrote: > Still need the boring and obvious definition of SET_ETHTOOL_OPS in > mainline, though. Like I said, I've got no problem with that part. From jgarzik@pobox.com Fri Aug 1 16:58:00 2003 Received: with ECARTIS (v1.0.0; list netdev); Fri, 01 Aug 2003 16:58:06 -0700 (PDT) Received: from www.linux.org.uk (IDENT:SvMafmN//vZSJhMjuzBvSDjdG3jdClZ3@parcelfarce.linux.theplanet.co.uk [195.92.249.252]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h71NvwFl023495 for ; Fri, 1 Aug 2003 16:57:59 -0700 Received: from rdu26-227-011.nc.rr.com ([66.26.227.11] helo=pobox.com) by www.linux.org.uk with esmtp (Exim 4.14) id 19ijmW-0001WQ-Qd; Sat, 02 Aug 2003 00:57:56 +0100 Message-ID: <3F2AFE7A.10203@pobox.com> Date: Fri, 01 Aug 2003 19:57:46 -0400 From: Jeff Garzik Organization: none User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.2.1) Gecko/20021213 Debian/1.2.1-2.bunk X-Accept-Language: en MIME-Version: 1.0 To: Ben Greear CC: "'netdev@oss.sgi.com'" Subject: Re: 2.4.21: bug report for tg3: tx lockup when changing MTU References: <3F2AEFBF.3040604@candelatech.com> In-Reply-To: <3F2AEFBF.3040604@candelatech.com> Content-Type: text/plain; charset=us-ascii; format=flowed Content-Transfer-Encoding: 7bit X-archive-position: 4444 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: jgarzik@pobox.com Precedence: bulk X-list: netdev Ben Greear wrote: > I just noticed that if you change the MTU of a tg3 NIC, it will not work > untill you ifdown/ifup it. This problem is repeatable on tg3, and does not > happen with the e1000 driver/cards. > > I am setting the MTU via an ioctl call, not via ifconfig or something like > that. Can you provide the ioctl call info, so I can reproduce? And, are you changing MTU when the interface is up or down? From jgarzik@pobox.com Fri Aug 1 17:00:26 2003 Received: with ECARTIS (v1.0.0; list netdev); Fri, 01 Aug 2003 17:00:32 -0700 (PDT) Received: from www.linux.org.uk (IDENT:mbLF7vnQT6LUjTsH4zlXT2XhjWraHjZQ@parcelfarce.linux.theplanet.co.uk [195.92.249.252]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h7200PFl023951 for ; Fri, 1 Aug 2003 17:00:26 -0700 Received: from rdu26-227-011.nc.rr.com ([66.26.227.11] helo=pobox.com) by www.linux.org.uk with esmtp (Exim 4.14) id 19ij1u-0001Be-Az; Sat, 02 Aug 2003 00:09:46 +0100 Message-ID: <3F2AF32F.7090201@pobox.com> Date: Fri, 01 Aug 2003 19:09:35 -0400 From: Jeff Garzik Organization: none User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.2.1) Gecko/20021213 Debian/1.2.1-2.bunk X-Accept-Language: en MIME-Version: 1.0 To: "David S. Miller" CC: willy@debian.org, netdev@oss.sgi.com Subject: Re: [PATCH] ethtool_ops rev 4 References: <20030801150232.GV22222@parcelfarce.linux.theplanet.co.uk> <20030801154021.GA7696@gtf.org> <20030801154656.GW22222@parcelfarce.linux.theplanet.co.uk> <20030801162536.GA18574@gtf.org> <20030801132037.3f3542ae.davem@redhat.com> <3F2AE91D.5090705@pobox.com> <3F2AEB33.9050506@pobox.com> <20030801153439.4a324c36.davem@redhat.com> In-Reply-To: <20030801153439.4a324c36.davem@redhat.com> Content-Type: text/plain; charset=us-ascii; format=flowed Content-Transfer-Encoding: 7bit X-archive-position: 4446 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: jgarzik@pobox.com Precedence: bulk X-list: netdev David S. Miller wrote: > On Fri, 01 Aug 2003 18:35:31 -0400 > Jeff Garzik wrote: > > >>We want to provide a sane, ifdef-free path to kcompat, where feasible. > > > I don't believe it's possible with netdev_ops, without > undoing the entire purpose of what netdev_ops is trying > to accomplish (elimination of code duplication). > > Show me, in code not words, how you are able to accomplish > this with SET_NETDEV_OPS() or whatever. I will not read > english text describing the scheme, I will read only code :) Read kcompat. Then: #define SET_ETHTOOL_OPS kcompat_set_ethtool_ops #define DO_ETHTOOL_OPS /* duplicate net/core/ethtool.c, basically */ I would define both of these in Matthew's patch, but one only _needs_ to define SET_ETHTOOL_OPS, so I pushed for the latter course. So why is SET_ETHTOOL_OPS needed? It covered up the one place It intentionally follows the same design as SET_MODULE_OWNER, and for the same purpose: hiding what would otherwise be a naked struct deref to a struct member that does not exist on an older kernel. Hiding naked struct derefs is also the reason I created pci_{get,drv}_drvdata, pci_resource_*, etc. Back compat is really a big syntactic sugar game, and naked struct derefs are really the only big thorn in the side. Everything else can be beaten down with syntactic sugar behind the scenes, that never ever gets merged into the upstream kernel. Jeff From jgarzik@pobox.com Fri Aug 1 17:00:24 2003 Received: with ECARTIS (v1.0.0; list netdev); Fri, 01 Aug 2003 17:00:31 -0700 (PDT) Received: from www.linux.org.uk (IDENT:allKiZwinkLTubXzBheWp6K1ooOLy/4T@parcelfarce.linux.theplanet.co.uk [195.92.249.252]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h7200NFl023944 for ; Fri, 1 Aug 2003 17:00:23 -0700 Received: from rdu26-227-011.nc.rr.com ([66.26.227.11] helo=pobox.com) by www.linux.org.uk with esmtp (Exim 4.14) id 19iiMK-0000pf-0I; Fri, 01 Aug 2003 23:26:48 +0100 Message-ID: <3F2AE91D.5090705@pobox.com> Date: Fri, 01 Aug 2003 18:26:37 -0400 From: Jeff Garzik Organization: none User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.2.1) Gecko/20021213 Debian/1.2.1-2.bunk X-Accept-Language: en MIME-Version: 1.0 To: "David S. Miller" CC: willy@debian.org, netdev@oss.sgi.com Subject: Re: [PATCH] ethtool_ops rev 4 References: <20030801150232.GV22222@parcelfarce.linux.theplanet.co.uk> <20030801154021.GA7696@gtf.org> <20030801154656.GW22222@parcelfarce.linux.theplanet.co.uk> <20030801162536.GA18574@gtf.org> <20030801132037.3f3542ae.davem@redhat.com> In-Reply-To: <20030801132037.3f3542ae.davem@redhat.com> Content-Type: text/plain; charset=us-ascii; format=flowed Content-Transfer-Encoding: 7bit X-archive-position: 4445 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: jgarzik@pobox.com Precedence: bulk X-list: netdev David S. Miller wrote: > The whole _POINT_ of these ops are to avoid duplicated code. > If someone is absolutely adament about supporting kernels > without ops support they should not support it at all. > > The point is to avoid code duplication, but what you suggest can only > be used to keep the duplicated code around "just in case". This makes > exactly no sense at all, it severs only to defeat the whole purpose > of the change in the first place. > > I totally am against making an ifdef test available for this, it can > only result in illogical things being done by driver maintainers. Strangely enough, creating a SET_ETHTOOL_OPS() macro (or netif_ethtool_ops or pick your name) reduces ifdefs. I feel that I've helped shepherd the net driver and PCI APIs to maintain something fairly interesting: a driver API that [for the most part...] allows one to write a driver completely without compatibility ifdefs, and ancient-kernel junk. When married with a compat glue lib outside the tree, the same ifdef-free driver works on older kernels. It's an explicit goal to avoid changing the driver API in such a way that there is a remotely sane path to supporting older kernels. One of the few things that is not easily work-around-able is new additions to existing structures (which wouldn't exist in older kernels). That's what SET_ETHTOOL_OPS would wrap, while also providing a trigger for generic compat glue. This trigger is what _reduces_ code duplication. Given such a trigger, a generic library can implement compat code on older kernels. The drivers remain ifdef-free and compat-junk-free. This is method used by the kcompat toolkit (http://sf.net/projects/gkernel/). This (IMO) feature continually saves me real time, again and again, when merging a new net driver into the kernel. It saves me time debugging a driver in both 2.4 and 2.6. The time savings is in the minimization (is that a word?) of changes across kernel versions, and this particular ethtool_ops change will be a thorn in particular. This ethtool_ops change _is_ trivially made backward-compatible, with a simple macro. Look at the future, where vendors are submitting 2.6-ready net drivers, because we made it easier for them to support their existing platform. Over and above the time savings, vendors _will_ start submitting drivers that actually look like Linux drivers. This has already started happening :) Just today I received a Via-rhine gbit driver (GPL'd) at Red Hat, which I am preparing to merge into the kernel. After removing the awful Hungarian notation and silly procfs apis, the driver's actually pretty close to a mergeable driver. It uses the kcompat stuff, and as such isn't full of ifdefs and typical vendor cpp maze. So, for the benefits of saving me real wall-clock hours, and pushing the vendors to create ready-for-the-kernel drivers more often, the cost is a simple one-line wrapper macro that in-kernel drivers would rarely use. In the long run, I'm trying to use and abuse Intel as an example for other vendors to follow (using netdev@, splitting up patches, etc.), and push the driver maintenance load onto the vendors (where they're willing, etc., like Intel). If vendors are willing to respond to feedback and follow standard linux-kernel email development, I'm more than happy for them to become a learned funnel of patches to netdev for review :) This kcompat strategy -- back-compat without ifdefs -- goes a long way towards that, and SET_ETHTOOL_OPS is a big piece of that puzzle right now. Jeff From greearb@candelatech.com Fri Aug 1 17:24:16 2003 Received: with ECARTIS (v1.0.0; list netdev); Fri, 01 Aug 2003 17:24:54 -0700 (PDT) Received: from grok.yi.org (evrtwa1-ar2-4-33-045-074.evrtwa1.dsl-verizon.net [4.33.45.74]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h720OFFl025687 for ; Fri, 1 Aug 2003 17:24:16 -0700 Received: from candelatech.com (localhost.localdomain [127.0.0.1]) by grok.yi.org (8.12.8/8.12.8) with ESMTP id h720O0tf025224; Fri, 1 Aug 2003 17:24:10 -0700 Message-ID: <3F2B04A0.9030101@candelatech.com> Date: Fri, 01 Aug 2003 17:24:00 -0700 From: Ben Greear Organization: Candela Technologies User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.5a) Gecko/20030718 X-Accept-Language: en-us, en MIME-Version: 1.0 To: Jeff Garzik CC: "'netdev@oss.sgi.com'" Subject: Re: 2.4.21: bug report for tg3: tx lockup when changing MTU References: <3F2AEFBF.3040604@candelatech.com> <3F2AFE7A.10203@pobox.com> In-Reply-To: <3F2AFE7A.10203@pobox.com> Content-Type: text/plain; charset=us-ascii; format=flowed Content-Transfer-Encoding: 7bit X-archive-position: 4447 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: greearb@candelatech.com Precedence: bulk X-list: netdev Jeff Garzik wrote: > Ben Greear wrote: > >> I just noticed that if you change the MTU of a tg3 NIC, it will not work >> untill you ifdown/ifup it. This problem is repeatable on tg3, and >> does not >> happen with the e1000 driver/cards. >> >> I am setting the MTU via an ioctl call, not via ifconfig or something >> like >> that. > > > > Can you provide the ioctl call info, so I can reproduce? > > And, are you changing MTU when the interface is up or down? Interface is up and transmitting/receiving pkts at the time. I just reproduced it with commands below. It is probably a race, so not sure that either of these will always fail. Running about 10kpps rx+tx. Was sending pktgen (UDP) traffic of fixed length, so the actual transmitted packet sizes remains the same in this case. # MTU is at 1500 ifconfig eth5 mtu 4096 #worked ifconfig eth5 mtu 4000 # failed. -- Ben Greear Candela Technologies Inc http://www.candelatech.com From jgarzik@pobox.com Fri Aug 1 18:07:00 2003 Received: with ECARTIS (v1.0.0; list netdev); Fri, 01 Aug 2003 18:07:05 -0700 (PDT) Received: from www.linux.org.uk (IDENT:w1RYSqFLki8rczBJvztL9jiQ2bgDrUfw@parcelfarce.linux.theplanet.co.uk [195.92.249.252]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h7216xFl028005 for ; Fri, 1 Aug 2003 18:07:00 -0700 Received: from rdu26-227-011.nc.rr.com ([66.26.227.11] helo=pobox.com) by www.linux.org.uk with esmtp (Exim 4.14) id 19ijXz-0001Q9-5p; Sat, 02 Aug 2003 00:42:55 +0100 Message-ID: <3F2AFAF4.3040604@pobox.com> Date: Fri, 01 Aug 2003 19:42:44 -0400 From: Jeff Garzik Organization: none User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.2.1) Gecko/20021213 Debian/1.2.1-2.bunk X-Accept-Language: en MIME-Version: 1.0 To: "David S. Miller" CC: willy@debian.org, netdev@oss.sgi.com Subject: Re: [PATCH] ethtool_ops rev 4 References: <20030801150232.GV22222@parcelfarce.linux.theplanet.co.uk> <20030801154021.GA7696@gtf.org> <20030801154656.GW22222@parcelfarce.linux.theplanet.co.uk> <20030801162536.GA18574@gtf.org> <20030801132037.3f3542ae.davem@redhat.com> <3F2AE91D.5090705@pobox.com> <20030801153255.204baf66.davem@redhat.com> <3F2AF141.2010308@pobox.com> <20030801160136.3342c5cc.davem@redhat.com> <3F2AF525.3000605@pobox.com> <20030801161937.1d9a7126.davem@redhat.com> In-Reply-To: <20030801161937.1d9a7126.davem@redhat.com> Content-Type: text/plain; charset=us-ascii; format=flowed Content-Transfer-Encoding: 7bit X-archive-position: 4448 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: jgarzik@pobox.com Precedence: bulk X-list: netdev David S. Miller wrote: > On Fri, 01 Aug 2003 19:17:57 -0400 > Jeff Garzik wrote: > > >>Solution #2 chooses to create a tiny bit more >>merge-to-mainline pain, but also keeps the mainline kernel drivers more >>clean. > > > You don't need DO_ETHTOOL_OPS and thus the merge-to-mainline pain > at all if you do something like: > > 1) SET_ETHDEV_OPS() also overrides the ->do_ioctl() setting to > a kcompat_netdev_ioctl() one, but remembers the original pointer > somewhere. > > 2) kcompat_netdev_ioctl() does the things DO_ETHTOOL_OPS would > have done, failing that it calls the saved ->do_ioctl() pointer. Certainly. That's a bit nicer than the back-compat gunk I was plotting, even. Still need the boring and obvious definition of SET_ETHTOOL_OPS in mainline, though. Jeff From takamiya@po.ntts.co.jp Fri Aug 1 19:59:17 2003 Received: with ECARTIS (v1.0.0; list netdev); Fri, 01 Aug 2003 19:59:20 -0700 (PDT) Received: from mail1.ics.ntts.co.jp (mail1.ics.ntts.co.jp [202.32.24.45]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h722xDFl001702 for ; Fri, 1 Aug 2003 19:59:16 -0700 Received: from mail26.silk.ntts.co.jp by mail1.ics.ntts.co.jp (8.9.3p2/3.7W-NTTSOFT-SUR2.0) id LAA11990 for ; Sat, 2 Aug 2003 11:59:12 +0900 (JST) (envelope-from takamiya@po.ntts.co.jp) Received: from daemon.inl.ntts.co.jp by mail26.silk.ntts.co.jp (8.11.7/3.7W-silk-4.6) id h722xB316268 for ; Sat, 2 Aug 2003 11:59:11 +0900 (JST) (envelope-from takamiya@po.ntts.co.jp) Received: (qmail 54448 invoked by alias); 2 Aug 2003 11:59:10 +0900 Received: (qmail 54428 invoked from network); 2 Aug 2003 11:59:10 +0900 Received: from localhost by localhost with SMTP; 2 Aug 2003 11:59:10 +0900 Date: Sat, 02 Aug 2003 11:59:09 +0900 (JST) Message-Id: <20030802.115909.576029077.takamiya@po.ntts.co.jp> To: nebuchadnezzar@nerim.net Cc: netdev@oss.sgi.com, takamiya@po.ntts.co.jp Subject: Re: [PATCH] 2.4.x USAGI mipv6_ha_ipsec From: Noriaki Takamiya In-Reply-To: <87n0etgt7w.fsf@zion.matrix> <87fzklgsh6.fsf@zion.matrix> References: <87n0etgt7w.fsf@zion.matrix> X-Face: +<)&j!Ce24nM@a.\f6TA,]^9Q76[_QN_[QR-(bT&>b40Oo[:`R(>b7!b-|q5k&.8CO[_Oh_ !9Nk0rikK70~?|08EFH|:]iF6pwPlnfEn-wo-voY:rP?%7p%cxjnbf'hglO'se&QwZN7/RVX!U7*P% cTV('HfHp+?g1+hx7\+J.W]G zYWv%LsDc X-Mailer: Mew version 3.2rc1 on XEmacs 21.4.8 (Honest Recruiter) Mime-Version: 1.0 Content-Type: Text/Plain; charset=us-ascii Content-Transfer-Encoding: 7bit X-archive-position: 4449 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: takamiya@po.ntts.co.jp Precedence: bulk X-list: netdev Hi, >> Fri, 01 Aug 2003 19:53:23 +0200 >> [Subject: [PATCH] 2.4.x USAGI mipv6_ha_ipsec] >> "Daniel 'NebuchadnezzaR' Dehennin" wrote... nebuchadnezzar> I search for the definition of in6_ntop, it in include/linux/inet.h nebuchadnezzar> so I make that patch. >> Fri, 01 Aug 2003 20:09:25 +0200 >> [Subject: [PATCH 2] 2.4.x USAGI unused variables in mipv6_ha_ipsec.c] >> "Daniel 'NebuchadnezzaR' Dehennin" wrote... nebuchadnezzar> Hello again ;-), nebuchadnezzar> nebuchadnezzar> A patch to remove unused variables : Applied both fixes. Thakns. -- Noriaki Takamiya From akpm@osdl.org Sat Aug 2 01:12:08 2003 Received: with ECARTIS (v1.0.0; list netdev); Sat, 02 Aug 2003 01:12:16 -0700 (PDT) Received: from mail.osdl.org (fw.osdl.org [65.172.181.6]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h728BxFl020572 for ; Sat, 2 Aug 2003 01:12:07 -0700 Received: from mnm (build.pdx.osdl.net [172.20.1.2]) by mail.osdl.org (8.11.6/8.11.6) with ESMTP id h728BnI26175 for ; Sat, 2 Aug 2003 01:11:51 -0700 Date: Sat, 2 Aug 2003 01:12:48 -0700 From: Andrew Morton To: netdev@oss.sgi.com Subject: Fw: [Bugme-new] [Bug 1030] New: racoon causes oops when implementing IPSec key Message-Id: <20030802011248.6772c9cd.akpm@osdl.org> X-Mailer: Sylpheed version 0.9.4 (GTK+ 1.2.10; i686-pc-linux-gnu) Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit X-archive-position: 4450 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: akpm@osdl.org Precedence: bulk X-list: netdev Begin forwarded message: Date: Sat, 2 Aug 2003 01:01:24 -0700 From: bugme-daemon@osdl.org To: bugme-new@lists.osdl.org Subject: [Bugme-new] [Bug 1030] New: racoon causes oops when implementing IPSec key http://bugme.osdl.org/show_bug.cgi?id=1030 Summary: racoon causes oops when implementing IPSec key Kernel Version: 2.6.0-test1 Status: NEW Severity: normal Owner: acme@conectiva.com.br Submitter: jsanchez@cs.ucf.edu Distribution: SuSE and LFS Hardware Environment: e100 cards Software Environment: ipsec-tools 0.2.2 Problem Description: I setkey with a policy to use esp and ah on each box. I start racoon on each box. I punch up a web page on one from the other. Insta-oops x 2. Unable to handle kernel NULL pointer dereference at virtual address 00000000 printing eip: c02bbd06 *pde = 00000000 Oops: 0000 [#1] CPU: 0 EIP: 0060:[] Not tainted EFLAGS: 00010206 EIP is at memcpy+0x1e/0x39 eax: 00000018 ebx: f6fe8a00 ecx: 00000006 edx: 00000000 esi: 00000000 edi: 00000000 ebp: c0562520 esp: f6fb5ccc ds: 007b es: 007b ss:0068 Process racoon (pid: 418, threadinfo=f6fb4000 task=f6fbb300) Stack: Call Trace: xfrm_state_update pfkey_add parse_exthdrs pfkey_process pfkey_sendmsg sock_sendmsg verify_iovec sys_sendmsg sockfd_lookup sys_sendto sys_getsockname __pollwait update_process sys_send sys_socketcall syscall_call Code: f3 a5 a8 02 74 02 66 a5 a8 01 74 01 a4 89 d0 8b 74 24 02 8b <0>Kernel panic: Fatal exception in interrupt In interrupt handler = not syncing For some of the other numbers that didn't get copied, check 67.9.9.32/oops.jpg. Email me if its dead, which it will be after 20 august. Steps to reproduce: >From each box: #!setkey -f flush; spdflush; spdadd $this_box $other_box any -P out ipsec esp/transport//use ah/transport//use; spdadd $other_box $this_box any -P in ipsec esp/transport//use ah/transport//use; Set up racoon (the default config would probably work, here is the gist of mine) remote anonymous { exchange_mode main; my_identifier address; peers_identifier address; lifetime time 1 min; # sec,min,hour proposal { encryption_algorithm 3des; hash_algorithm sha1; authentication_method pre_shared_key ; dh_group 2; } } sainfo anonymous { lifetime time 20 min; encryption_algorithm 3des ; authentication_algorithm hmac_sha1; compression_algorithm deflate ; } Start racoon on each box. Open a new connection to cause a key exchange. Hit the reset button on each box. ------- You are receiving this mail because: ------- You are on the CC list for the bug, or are watching someone who is. From davem@redhat.com Sat Aug 2 01:17:58 2003 Received: with ECARTIS (v1.0.0; list netdev); Sat, 02 Aug 2003 01:18:02 -0700 (PDT) Received: from pizda.ninka.net (IDENT:root@pizda.ninka.net [216.101.162.242]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h728HvFl020948 for ; Sat, 2 Aug 2003 01:17:58 -0700 Received: from pizda.ninka.net (IDENT:davem@localhost.localdomain [127.0.0.1]) by pizda.ninka.net (8.9.3/8.9.3) with SMTP id BAA09115; Sat, 2 Aug 2003 01:13:58 -0700 Date: Sat, 2 Aug 2003 01:13:58 -0700 From: "David S. Miller" To: Andrew Morton Cc: netdev@oss.sgi.com Subject: Re: Fw: [Bugme-new] [Bug 1030] New: racoon causes oops when implementing IPSec key Message-Id: <20030802011358.0524c88c.davem@redhat.com> In-Reply-To: <20030802011248.6772c9cd.akpm@osdl.org> References: <20030802011248.6772c9cd.akpm@osdl.org> X-Mailer: Sylpheed version 0.9.2 (GTK+ 1.2.6; sparc-unknown-linux-gnu) Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit X-archive-position: 4451 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: davem@redhat.com Precedence: bulk X-list: netdev Old bug, fixed in current sources. From sascha@schumann.cx Sat Aug 2 02:44:47 2003 Received: with ECARTIS (v1.0.0; list netdev); Sat, 02 Aug 2003 02:44:55 -0700 (PDT) Received: from milton.schell.de (kdserv.de [217.160.72.35]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h729ijFl027701 for ; Sat, 2 Aug 2003 02:44:46 -0700 Received: (qmail 29266 invoked by uid 501); 2 Aug 2003 09:44:44 -0000 Received: from unknown (HELO eco.foo) (80.143.24.176) by kdserv.de with SMTP; 2 Aug 2003 09:44:44 -0000 Received: from localhost (localhost [127.0.0.1]) by eco.foo (Postfix) with ESMTP id 554E437045; Sat, 2 Aug 2003 11:44:43 +0200 (CEST) Date: Sat, 2 Aug 2003 11:44:43 +0200 (CEST) From: Sascha Schumann X-X-Sender: sas@eco.foo To: Ben Greear Cc: "'netdev@oss.sgi.com'" Subject: Re: 2.4.21: bug report for tg3: tx lockup when changing MTU In-Reply-To: <3F2AEFBF.3040604@candelatech.com> Message-ID: References: <3F2AEFBF.3040604@candelatech.com> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII X-archive-position: 4452 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: sascha@schumann.cx Precedence: bulk X-list: netdev > Kernel is 2.4.21 + custom patches (which should not affect tg3). > > lspci says the NIC is: Altima AC9100 (rev 15) [1] says that the AC9100 based Netgear GA302T cards don't support jumbo frames. I'm seeing regular lockups once packets larger than 1500bytes flow through the NIC. It would be cool though if this turned out to be a driver limitation and not a (crippled) chipset issue. [1] http://www.google.de/search?q=cache:y_kVF_dR3TkJ:www.lanshop.co.uk/html/ga302tq.htm+netgear+ga302t+jumbo+frames&hl=de&ie=UTF-8 - Sascha From daniel.ritz@gmx.ch Sat Aug 2 04:53:56 2003 Received: with ECARTIS (v1.0.0; list netdev); Sat, 02 Aug 2003 04:54:03 -0700 (PDT) Received: from ritz.dnsalias.org (dclient217-162-108-200.hispeed.ch [217.162.108.200]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h72BrsFl002495 for ; Sat, 2 Aug 2003 04:53:56 -0700 Received: from toshba.local (toshba.local [192.168.100.12]) by ritz.dnsalias.org (Postfix) with ESMTP id C83ED4FD7F; Sat, 2 Aug 2003 13:55:45 +0200 (CEST) From: Daniel Ritz To: "David S. Miller" Subject: [PATCH 2.6] Fix IPv6 esp mem leak in esp6_input Date: Sat, 2 Aug 2003 13:50:23 +0200 User-Agent: KMail/1.5.2 Cc: linux-net , "linux-netdev" MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Content-Disposition: inline Message-Id: <200308021350.23342.daniel.ritz@gmx.ch> X-archive-position: 4453 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: daniel.ritz@gmx.ch Precedence: bulk X-list: netdev fixes a mem leak in esp6_input() in the error paths. and return -ENOMEM, not -EINVAL when out of memory. against 2.6.0-test2-bk --- 1.19/net/ipv6/esp6.c Mon Jul 21 02:46:12 2003 +++ edited/net/ipv6/esp6.c Sat Aug 2 13:02:11 2003 @@ -200,18 +200,24 @@ int hdr_len = skb->h.raw - skb->nh.raw; int nfrags; - u8 ret_nexthdr = 0; unsigned char *tmp_hdr = NULL; + int ret = 0; - if (!pskb_may_pull(skb, sizeof(struct ipv6_esp_hdr))) - goto out; + if (!pskb_may_pull(skb, sizeof(struct ipv6_esp_hdr))) { + ret = -EINVAL; + goto out_nofree; + } - if (elen <= 0 || (elen & (blksize-1))) - goto out; + if (elen <= 0 || (elen & (blksize-1))) { + ret = -EINVAL; + goto out_nofree; + } tmp_hdr = kmalloc(hdr_len, GFP_ATOMIC); - if (!tmp_hdr) - goto out; + if (!tmp_hdr) { + ret = -ENOMEM; + goto out_nofree; + } memcpy(tmp_hdr, skb->nh.raw, hdr_len); /* If integrity check is required, do this. */ @@ -226,12 +232,15 @@ if (unlikely(memcmp(sum, sum1, alen))) { x->stats.integrity_failed++; + ret = -EINVAL; goto out; } } - if ((nfrags = skb_cow_data(skb, 0, &trailer)) < 0) + if ((nfrags = skb_cow_data(skb, 0, &trailer)) < 0) { + ret = -EINVAL; goto out; + } skb->ip_summed = CHECKSUM_NONE; @@ -251,8 +260,10 @@ if (unlikely(nfrags > MAX_SG_ONSTACK)) { sg = kmalloc(sizeof(struct scatterlist)*nfrags, GFP_ATOMIC); - if (!sg) + if (!sg) { + ret = -ENOMEM; goto out; + } } skb_to_sgvec(skb, sg, sizeof(struct ipv6_esp_hdr) + esp->conf.ivlen, elen); crypto_cipher_decrypt(esp->conf.tfm, sg, sg, elen); @@ -267,6 +278,7 @@ if (net_ratelimit()) { printk(KERN_WARNING "ipsec esp packet is garbage padlen=%d, elen=%d\n", padlen+2, elen); } + ret = -EINVAL; goto out; } /* ... check padding bits here. Silly. :-) */ @@ -277,13 +289,13 @@ memcpy(skb->nh.raw, tmp_hdr, hdr_len); skb->nh.ipv6h->payload_len = htons(skb->len - sizeof(struct ipv6hdr)); ip6_find_1stfragopt(skb, &prevhdr); - ret_nexthdr = *prevhdr = nexthdr[1]; + ret = *prevhdr = nexthdr[1]; } - kfree(tmp_hdr); - return ret_nexthdr; out: - return -EINVAL; + kfree(tmp_hdr); +out_nofree: + return ret; } static u32 esp6_get_max_size(struct xfrm_state *x, int mtu) From daniel.ritz@gmx.ch Sat Aug 2 08:46:45 2003 Received: with ECARTIS (v1.0.0; list netdev); Sat, 02 Aug 2003 08:46:49 -0700 (PDT) Received: from ritz.dnsalias.org (dclient217-162-108-200.hispeed.ch [217.162.108.200]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h72FkhFl016663 for ; Sat, 2 Aug 2003 08:46:44 -0700 Received: from toshba.local (toshba.local [192.168.100.12]) by ritz.dnsalias.org (Postfix) with ESMTP id C3B4C4FD7F; Sat, 2 Aug 2003 17:48:35 +0200 (CEST) From: Daniel Ritz To: Jeff Garzik Subject: [PATCH] fix airo memory leak Date: Sat, 2 Aug 2003 17:43:12 +0200 User-Agent: KMail/1.5.2 Cc: linux-net , "linux-netdev" MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Content-Disposition: inline Message-Id: <200308021743.12635.daniel.ritz@gmx.ch> X-archive-position: 4454 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: daniel.ritz@gmx.ch Precedence: bulk X-list: netdev fixes a memory leak: memory for the airo_devices list is allocated but never freed. against 2.6.0-test2-bk, but should apply to 2.4 as well... --- 1.54/drivers/net/wireless/airo.c Sun Jul 20 05:17:02 2003 +++ edited/drivers/net/wireless/airo.c Sat Aug 2 17:33:24 2003 @@ -4038,12 +4038,23 @@ return 0; } -static void del_airo_dev( struct net_device *dev ) { - struct net_device_list **p = &airo_devices; - while( *p && ( (*p)->dev != dev ) ) - p = &(*p)->next; - if ( *p && (*p)->dev == dev ) - *p = (*p)->next; +static void del_airo_dev(struct net_device *dev) +{ + struct net_device_list *this = airo_devices, *prev = NULL; + + while (this) { + if (this->dev == dev) { + if (prev) + prev->next = this->next; + else + airo_devices = this->next; + kfree(this); + break; + } + + prev = this; + this = this->next; + } } #ifdef CONFIG_PCI From werner@almesberger.net Sat Aug 2 10:04:56 2003 Received: with ECARTIS (v1.0.0; list netdev); Sat, 02 Aug 2003 10:05:08 -0700 (PDT) Received: from host.almesberger.net (almesberger.net [63.105.73.239] (may be forged)) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h72H4uFl022817 for ; Sat, 2 Aug 2003 10:04:56 -0700 Received: from almesberger.net (vpnwa-home [10.200.0.2]) by host.almesberger.net (8.11.6/8.9.3) with ESMTP id h72H4oG24090; Sat, 2 Aug 2003 10:04:50 -0700 Received: (from werner@localhost) by almesberger.net (8.11.6/8.11.6) id h72H4i330124; Sat, 2 Aug 2003 14:04:44 -0300 Date: Sat, 2 Aug 2003 14:04:44 -0300 From: Werner Almesberger To: netdev@oss.sgi.com, linux-kernel@vger.kernel.org Subject: TOE brain dump Message-ID: <20030802140444.E5798@almesberger.net> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline X-archive-position: 4455 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: werner@almesberger.net Precedence: bulk X-list: netdev At OLS, there was a bit of discussion on (true and false *) TOEs (TCP Offload Engines). In the course of this discussion, I've suggested what might be a novel approach, so in case this is a good idea, I'd like to dump my thoughts on it, before someone tries to patent my ideas. (Most likely, some of this has already been done or tried elsewhere, but it can't hurt to try to err on the safe side.) (*) The InfiniBand people unfortunately call also their TCP/IP bypass "TOE" (for which they promptly get shouted down, every time they use that word). This is misleading, because there is no TCP that's getting offloaded, but TCP is simply never done. I would consider it to be more accurate to view this as a separate networking technology, with semantics different from TCP/IP, similar to ATM and AAL5. While I'm not entirely convinced about the usefulness of TOE in all the cases it's been suggested for, I can see value in certain areas, e.g. when TCP per-packet overhead becomes an issue. However, I consider the approach of putting a new or heavily modified stack, which duplicates a considerable amount of the functionality in the main kernel, on a separate piece of hardware questionable at best. Some of the issues: - if this stack is closed source or generally hard to modify, security fixes will be slowed down - if this stack is closed source or generally hard to modify, TOE will not be available to projects modifying the stack, e.g. any of the research projects trying to make TCP work at gigabit speeds - this stack either needs to implement all administrative interfaces of the regular kernel, or such a system would have non-uniform configuration/monitoring across interfaces - in some cases, administrative interfaces will require a NIC/TOE-specific switch in the kernel (netlink helps here) - route changes on multi-homed hosts (or any similar kind of failover) are difficult if the state of TCP connections is tied to specific NICs (I've discussed some issues when "migrating" TCP connections in the documentation of tcpcp, http://www.almesberger.net/tcpcp/) - new kernel features will always lag behind on this kind of TOE, and different kernels will require different "firmware" - last but not least, keeping TOE firmware up to date with the TCP/IP stack in the mainstream kernel will require - for each such TOE device - a significant and continuous effort over a long period of time In short, I think such a solution is either a pain to use, or unmaintainable, or - most likely - both. So, how to do better ? Easy: use the Source, Luke. Here's my idea: - instead of putting a different stack on the TOE, a general-purpose processor (probably with some enhancements, and certainly with optimized data paths) is added to the NIC - that processor runs the same Linux kernel image as the host, acting like a NUMA system - a selectable part of TCP/IP is handled on the NIC, and the rest of the system runs on the host processor - instrumentation is added to the mainstream kernel to ensure that as little data as possible is shared between the main CPU and such peripheral CPUs. Note that such instrumentation would be generic, outlining possible boundaries, and not tied to a specific TOE design. - depending on hardware details (cache coherence, etc.), the instrumentation mentioned above may even be necessary for correctness. This would have the unfortunate effect of making the design very fragile with respect to changes in the mainstream kernel. (Performance loss in the case of imperfect instrumentation would be preferable.) - further instrumentation may be needed to let the kernel switch CPUs (i.e. host to NIC, and vice versa) at the right time - since the NIC would probably use a CPU design different from the host CPU, we'd need "fat" kernel binaries: - data structures are the same, i.e. word sizes, byte order, bit numbering, etc. are compatible, and alignments are chosen such that all CPUs involved are reasonably happy - kernels live in the same address space - function pointers become arrays, with one pointer per architecture. When comparing pointers, the first element is used. - if one should choose to also run parts of user space on the NIC, fat binaries would also be needed for this (along with other complications) Benefits: - putting the CPU next to the NIC keeps data paths short, and allows for all kinds of optimizations (e.g. a pipelined memory architecture) - the design is fairly generic, and would equally apply to other areas of the kernel than TCP/IP - using the same kernel image eliminates most maintenance problems, and encourages experimenting with the stack - using the same kernel image (and compatible data structures) guarantees that administrative interfaces are uniform in the entire system - such a design is likely to be able to allow TCP state to be moved to a different NIC, if necessary Possible problems, that may kill this idea: - it may be too hard to achieve correctness - it may be too hard to switch CPUs properly - it may not be possible to express copy operations efficiently in such a context - there may be no way to avoid sharing of hardware-specific data structures, such as page tables, or to emulate their use - people may consider the instrumentation required for this, although fairly generic, too intrusive - all this instrumentation may eat too much performance - nobody may be interested in building hardware for this - nobody may be patient enough to pursue such long-termish development, with uncertain outcome - something I haven't thought of I lack the resources (hardware, financial, and otherwise) to actually do something with these ideas, so please feel free to put them to some use. - Werner -- _________________________________________________________________________ / Werner Almesberger, Buenos Aires, Argentina werner@almesberger.net / /_http://www.almesberger.net/____________________________________________/ From niv@us.ibm.com Sat Aug 2 10:32:53 2003 Received: with ECARTIS (v1.0.0; list netdev); Sat, 02 Aug 2003 10:33:04 -0700 (PDT) Received: from e35.co.us.ibm.com (e35.co.us.ibm.com [32.97.110.133]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h72HWkFl024975 for ; Sat, 2 Aug 2003 10:32:53 -0700 Received: from westrelay03.boulder.ibm.com (westrelay03.boulder.ibm.com [9.17.195.12]) by e35.co.us.ibm.com (8.12.9/8.12.2) with ESMTP id h72HWVc8270888; Sat, 2 Aug 2003 13:32:31 -0400 Received: from us.ibm.com (d03av03.boulder.ibm.com [9.17.193.83]) by westrelay03.boulder.ibm.com (8.12.9/NCO/VER6.5) with ESMTP id h72HWUYc053666; Sat, 2 Aug 2003 11:32:31 -0600 Message-ID: <3F2BF5C7.90400@us.ibm.com> Date: Sat, 02 Aug 2003 10:32:55 -0700 From: Nivedita Singhvi User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.0; en-US; rv:1.2.1) Gecko/20021130 X-Accept-Language: en-us, en MIME-Version: 1.0 To: Werner Almesberger CC: netdev@oss.sgi.com, linux-kernel@vger.kernel.org Subject: Re: TOE brain dump References: <20030802140444.E5798@almesberger.net> In-Reply-To: <20030802140444.E5798@almesberger.net> Content-Type: text/plain; charset=us-ascii; format=flowed Content-Transfer-Encoding: 7bit X-archive-position: 4456 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: niv@us.ibm.com Precedence: bulk X-list: netdev Werner Almesberger wrote: > (*) The InfiniBand people unfortunately call also their TCP/IP > bypass "TOE" (for which they promptly get shouted down, > every time they use that word). This is misleading, because Thank you! Yes! All in favor say Aye..AYE!!! Motion passes, the infiniband people don't get to call it TOE anymore.. > While I'm not entirely convinced about the usefulness of TOE in > all the cases it's been suggested for, I can see value in certain > areas, e.g. when TCP per-packet overhead becomes an issue. Ditto, but I see it being used to rollout the idea and process, rather than anything of value now, and the lessons are being learned for the future, when we reach 20Gb, 40Gb, even faster networks of tommorow. The processors might keep up, but nothing else will, for sure. > However, I consider the approach of putting a new or heavily > modified stack, which duplicates a considerable amount of the > functionality in the main kernel, on a separate piece of hardware > questionable at best. Some of the issues: > > - if this stack is closed source or generally hard to modify, > security fixes will be slowed down as will bug fixes, and debugging becomes a right royal pain. Also, most profiles of networking applications show the largest blip is essentially the user<->kernel transfer, and that would still remain the unaddressed bottleneck. > So, how to do better ? Easy: use the Source, Luke. Here's my > idea: > > - instead of putting a different stack on the TOE, a > general-purpose processor (probably with some enhancements, > and certainly with optimized data paths) is added to the NIC The thing is, all the TOE efforts are propietary ones, to my limited knowledge. Thus all the design is occurring in confidential, vendor internal forums. How will they/we come up with really the needed, _common_ design approach? Or is this not so needed? thanks, Nivedita From werner@almesberger.net Sat Aug 2 11:06:12 2003 Received: with ECARTIS (v1.0.0; list netdev); Sat, 02 Aug 2003 11:06:21 -0700 (PDT) Received: from host.almesberger.net (almesberger.net [63.105.73.239] (may be forged)) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h72I6BFl027777 for ; Sat, 2 Aug 2003 11:06:11 -0700 Received: from almesberger.net (vpnwa-home [10.200.0.2]) by host.almesberger.net (8.11.6/8.9.3) with ESMTP id h72I65G24280; Sat, 2 Aug 2003 11:06:06 -0700 Received: (from werner@localhost) by almesberger.net (8.11.6/8.11.6) id h72I60t30481; Sat, 2 Aug 2003 15:06:00 -0300 Date: Sat, 2 Aug 2003 15:06:00 -0300 From: Werner Almesberger To: Nivedita Singhvi Cc: netdev@oss.sgi.com, linux-kernel@vger.kernel.org Subject: Re: TOE brain dump Message-ID: <20030802150600.F5798@almesberger.net> References: <20030802140444.E5798@almesberger.net> <3F2BF5C7.90400@us.ibm.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <3F2BF5C7.90400@us.ibm.com>; from niv@us.ibm.com on Sat, Aug 02, 2003 at 10:32:55AM -0700 X-archive-position: 4457 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: werner@almesberger.net Precedence: bulk X-list: netdev Nivedita Singhvi wrote: > Also, most profiles of networking applications show the > largest blip is essentially the user<->kernel transfer, and > that would still remain the unaddressed bottleneck. I have some hope that sendfile plus a NUMA-like approach will be sufficient for keeping transfers away from buses and memory they don't need to touch. > The thing is, all the TOE efforts are propietary ones, to > my limited knowledge. Many companies default to "closed" designs if they're not given a convincing reason for going "open". The approach I've described may provide that reason. There are also historicial reasons, e.g. if you want to interface with the stack of Windows, or any proprietary Unix, you probably need to obtain some of their source under NDA, and use some of that information in your own drivers or firmware. Of course, none of this is an issue here. Since we're talking about 1-2 years of development time anyway, legacy hardware (i.e. hardware choices influenced by information obtained under an NDA) will be quite obsolete by then and doesn't matter. > Or is this not so needed? Exactly. The "NUMA" approach would avoid the "common TOE design" problem. All you need is a reasonably well documented "general-purpose" CPU (that doesn't mean it has to be an off-the-shelf design, but most likely, the core would be an off-the-shelf one), plus some NIC hardware. Now, if that NIC in turn has some hidden secrets, this isn't an issue as long as one can still write a GPLed driver for it. Of course, there would be elements in such a system that vendors would like to keep secret. But then, there always are, and so far, we've found reasonable compromises most of the time, so I don't see why this couldn't happen here, too. Also, if "classical TOE" patches keep getting rejected, but an open and maintainable approach makes it into the mainstream kernel, also the business aspects should become fairly clear. - Werner -- _________________________________________________________________________ / Werner Almesberger, Buenos Aires, Argentina werner@almesberger.net / /_http://www.almesberger.net/____________________________________________/ From jgarzik@pobox.com Sat Aug 2 12:09:07 2003 Received: with ECARTIS (v1.0.0; list netdev); Sat, 02 Aug 2003 12:09:18 -0700 (PDT) Received: from www.linux.org.uk (IDENT:yTEIorJOZS7ZcYhSCsBkoXNszRdAMY2B@parcelfarce.linux.theplanet.co.uk [195.92.249.252]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h72J96Fl032752 for ; Sat, 2 Aug 2003 12:09:07 -0700 Received: from rdu26-227-011.nc.rr.com ([66.26.227.11] helo=pobox.com) by www.linux.org.uk with esmtp (Exim 4.14) id 19j1kV-0003a8-LU; Sat, 02 Aug 2003 20:09:03 +0100 Message-ID: <3F2C0C44.6020002@pobox.com> Date: Sat, 02 Aug 2003 15:08:52 -0400 From: Jeff Garzik Organization: none User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.2.1) Gecko/20021213 Debian/1.2.1-2.bunk X-Accept-Language: en MIME-Version: 1.0 To: Nivedita Singhvi CC: Werner Almesberger , netdev@oss.sgi.com, linux-kernel@vger.kernel.org Subject: Re: TOE brain dump References: <20030802140444.E5798@almesberger.net> <3F2BF5C7.90400@us.ibm.com> In-Reply-To: <3F2BF5C7.90400@us.ibm.com> Content-Type: text/plain; charset=us-ascii; format=flowed Content-Transfer-Encoding: 7bit X-archive-position: 4458 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: jgarzik@pobox.com Precedence: bulk X-list: netdev My own brain dump: If one wants to go straight from disk to network, why is anyone bothering to involve the host CPU and host memory bus at all? Memory bandwidth and PCI bus bandwidth are still bottlenecks, no much how much of the net stack you offload. Regardless of how fast your network zooms packets, you've gotta keep that pipeline full to make use of it. And you've gotta do something intelligent with it, which in TCP's case involves the host CPU quite a bit. TCP is sufficiently complex, for a reason. It has to handle all manner of disturbingly slow and disturbing fast net connections, all jabbering at the same time. TCP is a "one size fits all" solution, but it doesn't work well for everyone. The "TCP Offload Everything" people really need to look at what data your users want to push, at such high speeds. It's obviously not over a WAN... so steer users away from TCP, to an IP protocol that is tuned for your LAN needs, and more friendly to some sort of h/w offloading solution. A "foo over ipv6" protocol that was designed for h/w offloading from the start, would be a far better idea than full TCP offload will ever be. In any case, when you approach these high speeds, you really must take a good look at the other end of the pipeline: what are you serving at 10Gb/s, 20Gb/s, 40Gb/s? For some time, I think the answer will be "highly specialized stuff" At some point, Intel networking gear will be able to transfer more bits per second than there exist atoms on planet Earth :) Garbage in, garbage out. So, fix the other end of the pipeline too, otherwise this fast network stuff is flashly but pointless. If you want to serve up data from disk, then start creating PCI cards that have both Serial ATA and ethernet connectors on them :) Cut out the middleman of the host CPU and host memory bus instead of offloading portions of TCP that do not need to be offloaded. Jeff From alan@lxorguk.ukuu.org.uk Sat Aug 2 14:01:45 2003 Received: with ECARTIS (v1.0.0; list netdev); Sat, 02 Aug 2003 14:01:52 -0700 (PDT) Received: from lxorguk.ukuu.org.uk (pc1-cwma1-5-cust4.swan.cable.ntl.com [80.5.120.4]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h72L1hFl024055 for ; Sat, 2 Aug 2003 14:01:44 -0700 Received: from dhcp22.swansea.linux.org.uk (dhcp22.swansea.linux.org.uk [127.0.0.1]) by lxorguk.ukuu.org.uk (8.12.8/8.12.5) with ESMTP id h72KvkC3020394; Sat, 2 Aug 2003 21:57:47 +0100 Received: (from alan@localhost) by dhcp22.swansea.linux.org.uk (8.12.8/8.12.8/Submit) id h72KvjLd020392; Sat, 2 Aug 2003 21:57:45 +0100 X-Authentication-Warning: dhcp22.swansea.linux.org.uk: alan set sender to alan@lxorguk.ukuu.org.uk using -f Subject: Re: TOE brain dump From: Alan Cox To: Werner Almesberger Cc: netdev@oss.sgi.com, Linux Kernel Mailing List In-Reply-To: <20030802140444.E5798@almesberger.net> References: <20030802140444.E5798@almesberger.net> Content-Type: text/plain Content-Transfer-Encoding: 7bit Organization: Message-Id: <1059857864.20305.14.camel@dhcp22.swansea.linux.org.uk> Mime-Version: 1.0 X-Mailer: Ximian Evolution 1.2.2 (1.2.2-5) Date: 02 Aug 2003 21:57:44 +0100 X-archive-position: 4459 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: alan@lxorguk.ukuu.org.uk Precedence: bulk X-list: netdev On Sad, 2003-08-02 at 18:04, Werner Almesberger wrote: > - last but not least, keeping TOE firmware up to date with the > TCP/IP stack in the mainstream kernel will require - for each > such TOE device - a significant and continuous effort over a > long period of time or even the protocol and protocol refinements.. > - instead of putting a different stack on the TOE, a > general-purpose processor (probably with some enhancements, > and certainly with optimized data paths) is added to the NIC Like say an opteron in the 2nd socket on the motherboard > Benefits: > > - putting the CPU next to the NIC keeps data paths short, and > allows for all kinds of optimizations (e.g. a pipelined > memory architecture) It moves the cost it doesnt make it vanish If I read you right you are arguing for a second processor running Linux.with its own independant memory bus. AMD make those already its called AMD64. I don't know anyone thinking at that level about partitioning one as an I/O processor. From werner@almesberger.net Sat Aug 2 14:49:18 2003 Received: with ECARTIS (v1.0.0; list netdev); Sat, 02 Aug 2003 14:49:28 -0700 (PDT) Received: from host.almesberger.net (almesberger.net [63.105.73.239] (may be forged)) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h72LnGFl028183 for ; Sat, 2 Aug 2003 14:49:17 -0700 Received: from almesberger.net (vpnwa-home [10.200.0.2]) by host.almesberger.net (8.11.6/8.9.3) with ESMTP id h72Ln8G25059; Sat, 2 Aug 2003 14:49:08 -0700 Received: (from werner@localhost) by almesberger.net (8.11.6/8.11.6) id h72Ln1e31495; Sat, 2 Aug 2003 18:49:01 -0300 Date: Sat, 2 Aug 2003 18:49:01 -0300 From: Werner Almesberger To: Jeff Garzik Cc: Nivedita Singhvi , netdev@oss.sgi.com, linux-kernel@vger.kernel.org Subject: Re: TOE brain dump Message-ID: <20030802184901.G5798@almesberger.net> References: <20030802140444.E5798@almesberger.net> <3F2BF5C7.90400@us.ibm.com> <3F2C0C44.6020002@pobox.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <3F2C0C44.6020002@pobox.com>; from jgarzik@pobox.com on Sat, Aug 02, 2003 at 03:08:52PM -0400 X-archive-position: 4460 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: werner@almesberger.net Precedence: bulk X-list: netdev Jeff Garzik wrote: > jabbering at the same time. TCP is a "one size fits all" solution, but > it doesn't work well for everyone. But then, ten "optimized xxPs" that work well in two different scenarios each, but not so good in the 98 others, wouldn't be much fun either. It's been tried a number of times. Usually, real life sneaks in at one point or another, leaving behind a complex mess. When they've sorted out these problems, regular TCP has caught up with the great optimized transport protocols. At that point, they return to their niche, sometimes tail between legs and muttering curses, sometimes shaking their fist and boldly proclaiming how badly they'll rub TCP in the dirt in the next round. Maybe they shed off some of the complexity, and trade it for even more aggressive optimization, which puts them into their niche even more firmly. Eventually, they fade away. There are cases where TCP doesn't work well, like a path of badly mismatched link layers, but such paths don't treat any protocol following the end-to-end principle kindly. Another problem of TCP is that it has grown a bit too many knobs you need to turn before it works over your really fast really long pipe. (In one of the OLS after dinner speeches, this was quite appropriately called the "wizard gap".) > It's obviously not over a WAN... That's why NFS turned off UDP checksums ;-) As soon as you put it on IP, it will crawl to distances you didn't imagine in your wildest dreams. It always does. > So, fix the other end of the pipeline too, otherwise this fast network > stuff is flashly but pointless. If you want to serve up data from disk, > then start creating PCI cards that have both Serial ATA and ethernet > connectors on them :) Cut out the middleman of the host CPU and host > memory bus instead of offloading portions of TCP that do not need to be > offloaded. That's a good point. A hierarchical memory structure can help here. Moving one end closer to the hardware, and letting it know (e.g. through sendfile) that also the other end is close (or can be reached more directly that through some hopelessly crowded main bus) may help too. - Werner -- _________________________________________________________________________ / Werner Almesberger, Buenos Aires, Argentina werner@almesberger.net / /_http://www.almesberger.net/____________________________________________/ From werner@almesberger.net Sat Aug 2 15:14:19 2003 Received: with ECARTIS (v1.0.0; list netdev); Sat, 02 Aug 2003 15:14:22 -0700 (PDT) Received: from host.almesberger.net (almesberger.net [63.105.73.239] (may be forged)) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h72MEJFl030541 for ; Sat, 2 Aug 2003 15:14:19 -0700 Received: from almesberger.net (vpnwa-home [10.200.0.2]) by host.almesberger.net (8.11.6/8.9.3) with ESMTP id h72MEGG25185; Sat, 2 Aug 2003 15:14:17 -0700 Received: (from werner@localhost) by almesberger.net (8.11.6/8.11.6) id h72MEBR31594; Sat, 2 Aug 2003 19:14:11 -0300 Date: Sat, 2 Aug 2003 19:14:11 -0300 From: Werner Almesberger To: Alan Cox Cc: netdev@oss.sgi.com, Linux Kernel Mailing List Subject: Re: TOE brain dump Message-ID: <20030802191411.H5798@almesberger.net> References: <20030802140444.E5798@almesberger.net> <1059857864.20305.14.camel@dhcp22.swansea.linux.org.uk> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <1059857864.20305.14.camel@dhcp22.swansea.linux.org.uk>; from alan@lxorguk.ukuu.org.uk on Sat, Aug 02, 2003 at 09:57:44PM +0100 X-archive-position: 4461 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: werner@almesberger.net Precedence: bulk X-list: netdev Alan Cox wrote: > It moves the cost it doesnt make it vanish I don't think it really can. What it can do is reduce the overhead (which usually translates to latency and burstiness) and the sharing. > If I read you right you are arguing for a second processor running > Linux.with its own independant memory bus. AMD make those already its > called AMD64. I don't know anyone thinking at that level about > partitioning one as an I/O processor. That's taking this idea to an extreme, yes. I'd think of using something as big as an amd64 for this as "too expensive", but perhaps it's cheap enough in the long run, compared to some "optimized" design. It would certainly have the advantage of already solving various consistency and compatibility issues. (That is, if your host CPUs is/are also amd64.) - Werner -- _________________________________________________________________________ / Werner Almesberger, Buenos Aires, Argentina werner@almesberger.net / /_http://www.almesberger.net/____________________________________________/ From willy@www.linux.org.uk Sat Aug 2 15:21:49 2003 Received: with ECARTIS (v1.0.0; list netdev); Sat, 02 Aug 2003 15:21:54 -0700 (PDT) Received: from www.linux.org.uk (IDENT:UcG6xcZ7ts6X+JIzSklM4po5qU39agf+@parcelfarce.linux.theplanet.co.uk [195.92.249.252]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h72MLlFl031389 for ; Sat, 2 Aug 2003 15:21:48 -0700 Received: from willy by www.linux.org.uk with local (Exim 4.14) id 19j4kz-0004pd-Mh; Sat, 02 Aug 2003 23:21:45 +0100 Date: Sat, 2 Aug 2003 23:21:45 +0100 From: Matthew Wilcox To: Jeff Garzik Cc: Matthew Wilcox , netdev@oss.sgi.com Subject: Re: [PATCH] ethtool_ops rev 4 Message-ID: <20030802222145.GE22222@parcelfarce.linux.theplanet.co.uk> References: <20030801150232.GV22222@parcelfarce.linux.theplanet.co.uk> <20030801154021.GA7696@gtf.org> <20030801154656.GW22222@parcelfarce.linux.theplanet.co.uk> <20030801162536.GA18574@gtf.org> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20030801162536.GA18574@gtf.org> User-Agent: Mutt/1.4.1i X-archive-position: 4462 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: willy@debian.org Precedence: bulk X-list: netdev On Fri, Aug 01, 2003 at 12:25:36PM -0400, Jeff Garzik wrote: > On Fri, Aug 01, 2003 at 04:46:56PM +0100, Matthew Wilcox wrote: > > On Fri, Aug 01, 2003 at 11:40:21AM -0400, Jeff Garzik wrote: > > > * need SET_ETHTOOL_OPS macro or HAVE_ETHTOOL_OPS test macro or similar > > It's standard netdevice.h practice, and, he didn't disagree w/ my > rebuttal. OK, now that the two of you thrashed out a design, here's my implementation: diff -u drivers/net/8139too.c drivers/net/8139too.c --- drivers/net/8139too.c 31 Jul 2003 17:09:52 -0000 +++ drivers/net/8139too.c 2 Aug 2003 18:38:25 -0000 @@ -973,7 +973,7 @@ dev->do_ioctl = netdev_ioctl; dev->tx_timeout = rtl8139_tx_timeout; dev->watchdog_timeo = TX_TIMEOUT; - dev->ethtool_ops = &rtl8139_ethtool_ops; + set_ethtool_ops(dev, &rtl8139_ethtool_ops); /* note: the hardware is not capable of sg/csum/highdma, however * through the use of skb_copy_and_csum_dev we enable these diff -u drivers/net/tg3.c drivers/net/tg3.c --- drivers/net/tg3.c 31 Jul 2003 11:12:10 -0000 +++ drivers/net/tg3.c 2 Aug 2003 18:37:54 -0000 @@ -6724,11 +6724,11 @@ dev->do_ioctl = tg3_ioctl; dev->tx_timeout = tg3_tx_timeout; dev->poll = tg3_poll; - dev->ethtool_ops = &tg3_ethtool_ops; dev->weight = 64; dev->watchdog_timeo = TG3_TX_TIMEOUT; dev->change_mtu = tg3_change_mtu; dev->irq = pdev->irq; + set_ethtool_ops(dev, &tg3_ethtool_ops); err = tg3_get_invariants(tp); if (err) { diff -u include/linux/netdevice.h include/linux/netdevice.h --- include/linux/netdevice.h 31 Jul 2003 13:06:23 -0000 +++ include/linux/netdevice.h 2 Aug 2003 18:37:16 -0000 @@ -477,6 +477,10 @@ */ #define SET_NETDEV_DEV(net, pdev) ((net)->class_dev.dev = (pdev)) +static inline void set_ethtool_ops(struct net_device *dev, struct ethtool_ops * ops) +{ + dev->ethtool_ops = ops; +} struct packet_type { Happy with that? > > > * I still do not see the need to change a simple storage of a constant > > > (into ethtool_gdrvinfo) into _four_ separate function call hooks (reg > > > dump len, eeprom dump len, nic-specific stats len, self-test len). > > > Internal kernel code that needs this information is always a slow path > > > anyway, so just call the ->get_drvinfo hook internally. > > > > slow path, sure, but increased stack usage. it's a tradeoff, and this way > > feels more clean to me. > > Additing a function hook each time you want to retrieve a new integer > value? That's feels overly excessive to me. Actually, it's a useful thing to do because it specifies what kind of answer we want. For example, up here, you called them all foo_len. That's not true. Some of them are a byte-count (== len), but some of them are a count of N-byte quantities. That's an unfortunate bit of design, but at least we can make it obvious to the driver-writer what we're expecting of them. > > > * I prefer not to add '#include ' to ethtool.h > > > > That means that any code which includes ethtool.h has to include types.h > > first (either implicitly or explicitly). The rule so far has been that > > header files should call out their dependencies explictly with an include > > of the appropriate file. So why *don't* you want it? > > Because I copy it to userspace :) linux/types.h exists in userspace ;-) You even _expect_ userspce to have already included it -- or where else are the `u32' quantities defined? -- "It's not Hollywood. War is real, war is primarily not about defeat or victory, it is about death. I've seen thousands and thousands of dead bodies. Do you think I want to have an academic debate on this subject?" -- Robert Fisk From jgarzik@pobox.com Sat Aug 2 15:35:00 2003 Received: with ECARTIS (v1.0.0; list netdev); Sat, 02 Aug 2003 15:35:04 -0700 (PDT) Received: from www.linux.org.uk (IDENT:VMLs27LF89sjE5OoM6LvqsoQ8uGmYllr@parcelfarce.linux.theplanet.co.uk [195.92.249.252]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h72MYwFl032658 for ; Sat, 2 Aug 2003 15:34:59 -0700 Received: from rdu26-227-011.nc.rr.com ([66.26.227.11] helo=pobox.com) by www.linux.org.uk with esmtp (Exim 4.14) id 19j4xl-0004tx-Rf; Sat, 02 Aug 2003 23:34:57 +0100 Message-ID: <3F2C3C86.6000202@pobox.com> Date: Sat, 02 Aug 2003 18:34:46 -0400 From: Jeff Garzik Organization: none User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.2.1) Gecko/20021213 Debian/1.2.1-2.bunk X-Accept-Language: en MIME-Version: 1.0 To: Matthew Wilcox CC: netdev@oss.sgi.com Subject: Re: [PATCH] ethtool_ops rev 4 References: <20030801150232.GV22222@parcelfarce.linux.theplanet.co.uk> <20030801154021.GA7696@gtf.org> <20030801154656.GW22222@parcelfarce.linux.theplanet.co.uk> <20030801162536.GA18574@gtf.org> <20030802222145.GE22222@parcelfarce.linux.theplanet.co.uk> In-Reply-To: <20030802222145.GE22222@parcelfarce.linux.theplanet.co.uk> Content-Type: text/plain; charset=us-ascii; format=flowed Content-Transfer-Encoding: 7bit X-archive-position: 4463 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: jgarzik@pobox.com Precedence: bulk X-list: netdev Matthew Wilcox wrote: > On Fri, Aug 01, 2003 at 12:25:36PM -0400, Jeff Garzik wrote: > >>On Fri, Aug 01, 2003 at 04:46:56PM +0100, Matthew Wilcox wrote: >> >>>On Fri, Aug 01, 2003 at 11:40:21AM -0400, Jeff Garzik wrote: >>> >>>>* need SET_ETHTOOL_OPS macro or HAVE_ETHTOOL_OPS test macro or similar >> >>It's standard netdevice.h practice, and, he didn't disagree w/ my >>rebuttal. > > > OK, now that the two of you thrashed out a design, here's my implementation: > > diff -u drivers/net/8139too.c drivers/net/8139too.c > --- drivers/net/8139too.c 31 Jul 2003 17:09:52 -0000 > +++ drivers/net/8139too.c 2 Aug 2003 18:38:25 -0000 > @@ -973,7 +973,7 @@ > dev->do_ioctl = netdev_ioctl; > dev->tx_timeout = rtl8139_tx_timeout; > dev->watchdog_timeo = TX_TIMEOUT; > - dev->ethtool_ops = &rtl8139_ethtool_ops; > + set_ethtool_ops(dev, &rtl8139_ethtool_ops); > > /* note: the hardware is not capable of sg/csum/highdma, however > * through the use of skb_copy_and_csum_dev we enable these > diff -u drivers/net/tg3.c drivers/net/tg3.c > --- drivers/net/tg3.c 31 Jul 2003 11:12:10 -0000 > +++ drivers/net/tg3.c 2 Aug 2003 18:37:54 -0000 > @@ -6724,11 +6724,11 @@ > dev->do_ioctl = tg3_ioctl; > dev->tx_timeout = tg3_tx_timeout; > dev->poll = tg3_poll; > - dev->ethtool_ops = &tg3_ethtool_ops; > dev->weight = 64; > dev->watchdog_timeo = TG3_TX_TIMEOUT; > dev->change_mtu = tg3_change_mtu; > dev->irq = pdev->irq; > + set_ethtool_ops(dev, &tg3_ethtool_ops); > > err = tg3_get_invariants(tp); > if (err) { > diff -u include/linux/netdevice.h include/linux/netdevice.h > --- include/linux/netdevice.h 31 Jul 2003 13:06:23 -0000 > +++ include/linux/netdevice.h 2 Aug 2003 18:37:16 -0000 > @@ -477,6 +477,10 @@ > */ > #define SET_NETDEV_DEV(net, pdev) ((net)->class_dev.dev = (pdev)) > > +static inline void set_ethtool_ops(struct net_device *dev, struct ethtool_ops * > ops) > +{ > + dev->ethtool_ops = ops; > +} It needs to be a macro for maximum flexibility. Also, no need to convert in-kernel drivers over to using it... Let driver authors use it or not as they choose. Jeff From davem@redhat.com Sat Aug 2 17:32:14 2003 Received: with ECARTIS (v1.0.0; list netdev); Sat, 02 Aug 2003 17:32:22 -0700 (PDT) Received: from pizda.ninka.net (IDENT:root@pizda.ninka.net [216.101.162.242]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h730WBFl009320 for ; Sat, 2 Aug 2003 17:32:14 -0700 Received: from pizda.ninka.net (IDENT:davem@localhost.localdomain [127.0.0.1]) by pizda.ninka.net (8.9.3/8.9.3) with SMTP id RAA10269; Sat, 2 Aug 2003 17:28:07 -0700 Date: Sat, 2 Aug 2003 17:28:07 -0700 From: "David S. Miller" To: Jeff Garzik Cc: willy@debian.org, netdev@oss.sgi.com Subject: Re: [PATCH] ethtool_ops rev 4 Message-Id: <20030802172807.3d56b4ea.davem@redhat.com> In-Reply-To: <3F2C3C86.6000202@pobox.com> References: <20030801150232.GV22222@parcelfarce.linux.theplanet.co.uk> <20030801154021.GA7696@gtf.org> <20030801154656.GW22222@parcelfarce.linux.theplanet.co.uk> <20030801162536.GA18574@gtf.org> <20030802222145.GE22222@parcelfarce.linux.theplanet.co.uk> <3F2C3C86.6000202@pobox.com> X-Mailer: Sylpheed version 0.9.2 (GTK+ 1.2.6; sparc-unknown-linux-gnu) Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit X-archive-position: 4464 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: davem@redhat.com Precedence: bulk X-list: netdev On Sat, 02 Aug 2003 18:34:46 -0400 Jeff Garzik wrote: > Matthew Wilcox wrote: > > +static inline void set_ethtool_ops(struct net_device *dev, struct ethtool_ops * > > ops) > > +{ > > + dev->ethtool_ops = ops; > > +} > > > It needs to be a macro for maximum flexibility. Yes, and please name it with capitol letters, ie. SET_ETHTOOL_OPS(), I have no idea why you used lower-case letters when Jeff and I referred to it consistently with caps. :-) From davem@redhat.com Sat Aug 2 18:37:31 2003 Received: with ECARTIS (v1.0.0; list netdev); Sat, 02 Aug 2003 18:37:35 -0700 (PDT) Received: from pizda.ninka.net (IDENT:root@pizda.ninka.net [216.101.162.242]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h731bUFl014661 for ; Sat, 2 Aug 2003 18:37:31 -0700 Received: from pizda.ninka.net (IDENT:davem@localhost.localdomain [127.0.0.1]) by pizda.ninka.net (8.9.3/8.9.3) with SMTP id SAA10442; Sat, 2 Aug 2003 18:33:10 -0700 Date: Sat, 2 Aug 2003 18:33:10 -0700 From: "David S. Miller" To: chas3@users.sourceforge.net Cc: chas@cmf.nrl.navy.mil, mitch@sfgoth.com, netdev@oss.sgi.com Subject: Re: [Linux-ATM-General] Re: [atmdrvr zatm] Remove obsolete EXACT_TS support Message-Id: <20030802183310.05e2cbbc.davem@redhat.com> In-Reply-To: <200307311426.h6VEQgsG023826@ginger.cmf.nrl.navy.mil> References: <20030730225741.GA57991@gaz.sfgoth.com> <200307311426.h6VEQgsG023826@ginger.cmf.nrl.navy.mil> X-Mailer: Sylpheed version 0.9.2 (GTK+ 1.2.6; sparc-unknown-linux-gnu) Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit X-archive-position: 4465 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: davem@redhat.com Precedence: bulk X-list: netdev On Thu, 31 Jul 2003 10:23:58 -0400 chas williams wrote: > please apply to 2.6. zatm will now compile on smp. it might > actually work if someone had some hardware to test it. Applied. From jgarzik@pobox.com Sat Aug 2 20:14:40 2003 Received: with ECARTIS (v1.0.0; list netdev); Sat, 02 Aug 2003 20:14:49 -0700 (PDT) Received: from www.linux.org.uk (IDENT:KG4N4yt9e4roVKuUrU9+s/Z/FMeFYkHk@parcelfarce.linux.theplanet.co.uk [195.92.249.252]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h733EdFl022330 for ; Sat, 2 Aug 2003 20:14:40 -0700 Received: from rdu26-227-011.nc.rr.com ([66.26.227.11] helo=pobox.com) by www.linux.org.uk with esmtp (Exim 4.14) id 19j9KQ-0006uX-5l; Sun, 03 Aug 2003 04:14:38 +0100 Message-ID: <3F2C7E12.8070904@pobox.com> Date: Sat, 02 Aug 2003 23:14:26 -0400 From: Jeff Garzik Organization: none User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.2.1) Gecko/20021213 Debian/1.2.1-2.bunk X-Accept-Language: en MIME-Version: 1.0 To: Matthew Wilcox CC: netdev@oss.sgi.com Subject: Re: [PATCH] ethtool_ops rev 4 References: <20030801150232.GV22222@parcelfarce.linux.theplanet.co.uk> <20030801154021.GA7696@gtf.org> <20030801154656.GW22222@parcelfarce.linux.theplanet.co.uk> <20030801162536.GA18574@gtf.org> <20030802222145.GE22222@parcelfarce.linux.theplanet.co.uk> <3F2C3C86.6000202@pobox.com> <20030803002744.GF22222@parcelfarce.linux.theplanet.co.uk> In-Reply-To: <20030803002744.GF22222@parcelfarce.linux.theplanet.co.uk> Content-Type: text/plain; charset=us-ascii; format=flowed Content-Transfer-Encoding: 7bit X-archive-position: 4466 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: jgarzik@pobox.com Precedence: bulk X-list: netdev Matthew Wilcox wrote: > On Sat, Aug 02, 2003 at 06:34:46PM -0400, Jeff Garzik wrote: > >>>diff -u include/linux/netdevice.h include/linux/netdevice.h >>>--- include/linux/netdevice.h 31 Jul 2003 13:06:23 -0000 >>>+++ include/linux/netdevice.h 2 Aug 2003 18:37:16 -0000 >>>@@ -477,6 +477,10 @@ >>> */ >>>#define SET_NETDEV_DEV(net, pdev) ((net)->class_dev.dev = (pdev)) >>> >>>+static inline void set_ethtool_ops(struct net_device *dev, struct >>>ethtool_ops * >>>ops) >>>+{ >>>+ dev->ethtool_ops = ops; >>>+} >> >> >>It needs to be a macro for maximum flexibility. > > > Nothing stops it being implemented as a macro in kcompat. Having it as > an inline function gives it argument typechecking which always gives me > the warm fuzzies. No, it _needs_ to be a macro for maximum flexibility. Most importantly, kcompat code may use '#ifndef SET_ETHTOOL_OPS' as a trigger, to signal that compat code is needed. No need for drivers to create tons of kernel-version-code ifdefs, just to test for when ethtool_ops appeared in 2.6, for when it starts appearing in 2.4 vendor backports, and (possibly) 2.4 itself. Also, doing it at the cpp level allows compat code to #undef it, if it _really_ knows what its doing, and the situation calls for it. >>Also, no need to convert in-kernel drivers over to using it... Let >>driver authors use it or not as they choose. > > > I took "Like pci_set_drvdata" as the most important part of your > argument... having everyone use it is no bad thing. Certainly. I have no real preferences either way, just noting that in-kernel drivers don't _need_ to use this macro. Jeff From greearb@candelatech.com Sat Aug 2 20:48:59 2003 Received: with ECARTIS (v1.0.0; list netdev); Sat, 02 Aug 2003 20:49:08 -0700 (PDT) Received: from grok.yi.org (evrtwa1-ar2-4-33-045-074.evrtwa1.dsl-verizon.net [4.33.45.74]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h733mwFl025313 for ; Sat, 2 Aug 2003 20:48:58 -0700 Received: from candelatech.com (localhost.localdomain [127.0.0.1]) by grok.yi.org (8.12.8/8.12.8) with ESMTP id h733mptf003818; Sat, 2 Aug 2003 20:48:52 -0700 Message-ID: <3F2C8623.2080106@candelatech.com> Date: Sat, 02 Aug 2003 20:48:51 -0700 From: Ben Greear Organization: Candela Technologies User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.5a) Gecko/20030718 X-Accept-Language: en-us, en MIME-Version: 1.0 To: Sascha Schumann CC: "'netdev@oss.sgi.com'" Subject: Re: 2.4.21: bug report for tg3: tx lockup when changing MTU References: <3F2AEFBF.3040604@candelatech.com> In-Reply-To: Content-Type: text/plain; charset=us-ascii; format=flowed Content-Transfer-Encoding: 7bit X-archive-position: 4467 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: greearb@candelatech.com Precedence: bulk X-list: netdev Sascha Schumann wrote: >>Kernel is 2.4.21 + custom patches (which should not affect tg3). >> >>lspci says the NIC is: Altima AC9100 (rev 15) > > > [1] says that the AC9100 based Netgear GA302T cards don't > support jumbo frames. I'm seeing regular lockups once > packets larger than 1500bytes flow through the NIC. > > It would be cool though if this turned out to be a driver > limitation and not a (crippled) chipset issue. It definately handles 4000 byte frames just fine, you just need to ifdown and ifup it after changing the MTU much of the time...or maybe only when running it under heavy load when you make the MTU change... Ben -- Ben Greear Candela Technologies Inc http://www.candelatech.com From greearb@candelatech.com Sat Aug 2 21:01:44 2003 Received: with ECARTIS (v1.0.0; list netdev); Sat, 02 Aug 2003 21:01:54 -0700 (PDT) Received: from grok.yi.org (evrtwa1-ar2-4-33-045-074.evrtwa1.dsl-verizon.net [4.33.45.74]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h7341iFl026650 for ; Sat, 2 Aug 2003 21:01:44 -0700 Received: from candelatech.com (localhost.localdomain [127.0.0.1]) by grok.yi.org (8.12.8/8.12.8) with ESMTP id h7341Vtf005412; Sat, 2 Aug 2003 21:01:31 -0700 Message-ID: <3F2C891B.7080004@candelatech.com> Date: Sat, 02 Aug 2003 21:01:31 -0700 From: Ben Greear Organization: Candela Technologies User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.5a) Gecko/20030718 X-Accept-Language: en-us, en MIME-Version: 1.0 To: Jeff Garzik CC: Nivedita Singhvi , Werner Almesberger , netdev@oss.sgi.com, linux-kernel@vger.kernel.org Subject: Re: TOE brain dump References: <20030802140444.E5798@almesberger.net> <3F2BF5C7.90400@us.ibm.com> <3F2C0C44.6020002@pobox.com> In-Reply-To: <3F2C0C44.6020002@pobox.com> Content-Type: text/plain; charset=us-ascii; format=flowed Content-Transfer-Encoding: 7bit X-archive-position: 4468 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: greearb@candelatech.com Precedence: bulk X-list: netdev Jeff Garzik wrote: > So, fix the other end of the pipeline too, otherwise this fast network > stuff is flashly but pointless. If you want to serve up data from disk, > then start creating PCI cards that have both Serial ATA and ethernet > connectors on them :) Cut out the middleman of the host CPU and host I for one would love to see something like this, and not just Serial ATA.. but maybe 8x Serial ATA and RAID :) Ben -- Ben Greear Candela Technologies Inc http://www.candelatech.com From scott.feldman@intel.com Sat Aug 2 21:34:50 2003 Received: with ECARTIS (v1.0.0; list netdev); Sat, 02 Aug 2003 21:34:58 -0700 (PDT) Received: from caduceus.jf.intel.com (fmr06.intel.com [134.134.136.7]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h734YmFl028503 for ; Sat, 2 Aug 2003 21:34:49 -0700 Received: from talaria.jf.intel.com (talaria.jf.intel.com [10.7.209.7]) by caduceus.jf.intel.com (8.11.6p2/8.11.6/d: outer.mc,v 1.66 2003/05/22 21:17:36 rfjohns1 Exp $) with ESMTP id h734Sid08681 for ; Sun, 3 Aug 2003 04:28:44 GMT Received: from orsmsxvs041.jf.intel.com (orsmsxvs041.jf.intel.com [192.168.65.54]) by talaria.jf.intel.com (8.11.6p2/8.11.6/d: inner.mc,v 1.35 2003/05/22 21:18:01 rfjohns1 Exp $) with SMTP id h733wGY02567 for ; Sun, 3 Aug 2003 03:58:17 GMT Received: from orsmsx332.amr.corp.intel.com ([192.168.65.60]) by orsmsxvs041.jf.intel.com (NAVGW 2.5.2.11) with SMTP id M2003080221344226801 for ; Sat, 02 Aug 2003 21:34:42 -0700 Received: from orsmsx402.amr.corp.intel.com ([192.168.65.208]) by orsmsx332.amr.corp.intel.com with Microsoft SMTPSVC(5.0.2195.5329); Sat, 2 Aug 2003 21:34:42 -0700 content-class: urn:content-classes:message MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" X-MimeOLE: Produced By Microsoft Exchange V6.0.6375.0 Subject: e100 "Ferguson" release Date: Sat, 2 Aug 2003 21:34:42 -0700 Message-ID: X-MS-Has-Attach: X-MS-TNEF-Correlator: Thread-Topic: e100 "Ferguson" release Thread-Index: AcNZeI636C/uaYjsSwqQ/jrIhuMDyw== From: "Feldman, Scott" To: X-OriginalArrivalTime: 03 Aug 2003 04:34:42.0802 (UTC) FILETIME=[8F15BD20:01C35978] Content-Transfer-Encoding: 8bit X-MIME-Autoconverted: from quoted-printable to 8bit by oss.sgi.com id h734YmFl028503 X-archive-position: 4469 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: scott.feldman@intel.com Precedence: bulk X-list: netdev New development version: http://sf.net/projects/e1000, e100-3.0.0_dev11.tar.gz Many thanks to JC [jchapman@katalix.com] for exploring the small packet performance w/ and w/o NAPI. This version includes one of his optimization; others may follow, but I wanted to get this goodness out now. * added opportunistic fast loop (no udelays) in e100_exec_cmd to wait for previous cmd to be accepted before queuing next cmd. Boost small packet performance. [jchapman@katalix.com]. * Use correct versions of dev_kfree_skb for depending on possible contexts. [jchapman@katalix.com]. * Added SET_NETDEV_DEV(). Looking for more testing on non-IA archs, power management, cardbus nics, and WoL. -scott From willy@www.linux.org.uk Sat Aug 2 22:01:08 2003 Received: with ECARTIS (v1.0.0; list netdev); Sat, 02 Aug 2003 22:01:14 -0700 (PDT) Received: from www.linux.org.uk (IDENT:xUszKN2jPgzXzfcq0ScA0iCxK31fnXwn@parcelfarce.linux.theplanet.co.uk [195.92.249.252]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h73516Fl030805 for ; Sat, 2 Aug 2003 22:01:08 -0700 Received: from willy by www.linux.org.uk with local (Exim 4.14) id 19j6iu-0005dM-4h; Sun, 03 Aug 2003 01:27:44 +0100 Date: Sun, 3 Aug 2003 01:27:44 +0100 From: Matthew Wilcox To: Jeff Garzik Cc: Matthew Wilcox , netdev@oss.sgi.com Subject: Re: [PATCH] ethtool_ops rev 4 Message-ID: <20030803002744.GF22222@parcelfarce.linux.theplanet.co.uk> References: <20030801150232.GV22222@parcelfarce.linux.theplanet.co.uk> <20030801154021.GA7696@gtf.org> <20030801154656.GW22222@parcelfarce.linux.theplanet.co.uk> <20030801162536.GA18574@gtf.org> <20030802222145.GE22222@parcelfarce.linux.theplanet.co.uk> <3F2C3C86.6000202@pobox.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <3F2C3C86.6000202@pobox.com> User-Agent: Mutt/1.4.1i X-archive-position: 4470 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: willy@debian.org Precedence: bulk X-list: netdev On Sat, Aug 02, 2003 at 06:34:46PM -0400, Jeff Garzik wrote: > >diff -u include/linux/netdevice.h include/linux/netdevice.h > >--- include/linux/netdevice.h 31 Jul 2003 13:06:23 -0000 > >+++ include/linux/netdevice.h 2 Aug 2003 18:37:16 -0000 > >@@ -477,6 +477,10 @@ > > */ > > #define SET_NETDEV_DEV(net, pdev) ((net)->class_dev.dev = (pdev)) > > > >+static inline void set_ethtool_ops(struct net_device *dev, struct > >ethtool_ops * > >ops) > >+{ > >+ dev->ethtool_ops = ops; > >+} > > > It needs to be a macro for maximum flexibility. Nothing stops it being implemented as a macro in kcompat. Having it as an inline function gives it argument typechecking which always gives me the warm fuzzies. > Also, no need to convert in-kernel drivers over to using it... Let > driver authors use it or not as they choose. I took "Like pci_set_drvdata" as the most important part of your argument... having everyone use it is no bad thing. -- "It's not Hollywood. War is real, war is primarily not about defeat or victory, it is about death. I've seen thousands and thousands of dead bodies. Do you think I want to have an academic debate on this subject?" -- Robert Fisk From jsanchez@cs.ucf.edu Sat Aug 2 22:50:58 2003 Received: with ECARTIS (v1.0.0; list netdev); Sat, 02 Aug 2003 22:51:04 -0700 (PDT) Received: from longwood.cs.ucf.edu (longwood.cs.ucf.edu [132.170.108.1]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h735ouFl002605 for ; Sat, 2 Aug 2003 22:50:57 -0700 Received: from mobile (eola [132.170.108.2]) by longwood.cs.ucf.edu (8.12.2/8.12.2) with ESMTP id h735oqB4001424 for ; Sun, 3 Aug 2003 01:50:52 -0400 (EDT) Subject: Re: [Bug 1030] New: racoon causes oops when implementing IPSec key From: Justin Sanchez To: netdev@oss.sgi.com In-Reply-To: <20030802212018.B14141@electric-eye.fr.zoreil.com> References: <89550000.1059833972@[10.10.2.4]> <20030802163333.A12217@electric-eye.fr.zoreil.com> <1059850039.1187.2.camel@mobile> <20030802212018.B14141@electric-eye.fr.zoreil.com> Content-Type: multipart/signed; micalg=pgp-sha1; protocol="application/pgp-signature"; boundary="=-2Y8LE63gCCNUG9OvqJTX" X-Mailer: Ximian Evolution 1.0.8 Date: 03 Aug 2003 01:51:22 -0400 Message-Id: <1059889883.1187.15.camel@mobile> Mime-Version: 1.0 X-archive-position: 4471 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: jsanchez@cs.ucf.edu Precedence: bulk X-list: netdev --=-2Y8LE63gCCNUG9OvqJTX Content-Type: text/plain Content-Transfer-Encoding: quoted-printable Hi. I had this problem on 2.6.0-test1 and -test2 and -test2-bk2, so I'll try to report it. I'm new to the scene, so I apologize in advance for this post. Background. 2 machines. e100 cards on each, if it matters. ipsec-tools 0.2.2. I give each of them directives to use esp and ah in transport mode. I turn on racoon on each box. I ping. Both panic. Following is about the message:. Unable to handle kernel NULL pointer dereference at virtual address 00000000 printing eip: c02bbd06 *pde =3D 00000000 Oops: 0000 [#1] CPU: 0 EIP: 0060:[] Not tainted EFLAGS: 00010206 EIP is at memcpy+0x1e/0x39 eax: 00000018 ebx: f6fe8a00 ecx: 00000006 edx: 00000000 esi: 00000000 edi: 00000000 ebp: c0562520 esp: f6fb5ccc ds: 007b es: 007b ss:0068 Process racoon (pid: 418, threadinfo=3Df6fb4000 task=3Df6fbb300) Stack: Call Trace: xfrm_state_update pfkey_add parse_exthdrs pfkey_process pfkey_sendmsg sock_sendmsg verify_iovec sys_sendmsg sockfd_lookup sys_sendto sys_getsockname __pollwait update_process sys_send sys_socketcall syscall_call Code: f3 a5 a8 02 74 02 66 a5 a8 01 74 01 a4 89 d0 8b 74 24 02 8b <0>Kernel panic: Fatal exception in interrupt In interrupt handler =3D not syncing If you want the full text of it, its at 67.9.9.32/oops.jpg. I'm probably just doing something stupid... On Sat, 2003-08-02 at 15:20, Francois Romieu wrote: > Justin Sanchez : > [...] > > How current? I've just seen it in -test2-bk2. >=20 > Forwarded to davem@redhat.com. >=20 > You may consider posting the data of the bug-report updated to -test2-bk2 > on netdev@oss.sgi.com. >=20 > -- > Ueimor >=20 --=-2Y8LE63gCCNUG9OvqJTX Content-Type: application/pgp-signature; name=signature.asc Content-Description: This is a digitally signed message part -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.2.2 (GNU/Linux) iD8DBQA/LKLaiLmb/rWLQdQRAmWuAJ4g7wXLF1O+gFi+jrLeThezwWAsywCgkiao YUjA6YtWFR9yOVO/5JnRKZc= =6zEM -----END PGP SIGNATURE----- --=-2Y8LE63gCCNUG9OvqJTX-- From davem@redhat.com Sat Aug 2 23:00:29 2003 Received: with ECARTIS (v1.0.0; list netdev); Sat, 02 Aug 2003 23:00:32 -0700 (PDT) Received: from pizda.ninka.net (IDENT:root@pizda.ninka.net [216.101.162.242]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h7360SFl003649 for ; Sat, 2 Aug 2003 23:00:29 -0700 Received: from pizda.ninka.net (IDENT:davem@localhost.localdomain [127.0.0.1]) by pizda.ninka.net (8.9.3/8.9.3) with SMTP id WAA10944; Sat, 2 Aug 2003 22:56:20 -0700 Date: Sat, 2 Aug 2003 22:56:19 -0700 From: "David S. Miller" To: Daniel Ritz Cc: linux-net@vger.kernel.org, netdev@oss.sgi.com Subject: Re: [PATCH 2.6] Fix IPv6 esp mem leak in esp6_input Message-Id: <20030802225619.17d477e3.davem@redhat.com> In-Reply-To: <200308021350.23342.daniel.ritz@gmx.ch> References: <200308021350.23342.daniel.ritz@gmx.ch> X-Mailer: Sylpheed version 0.9.2 (GTK+ 1.2.6; sparc-unknown-linux-gnu) Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit X-archive-position: 4472 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: davem@redhat.com Precedence: bulk X-list: netdev On Sat, 2 Aug 2003 13:50:23 +0200 Daniel Ritz wrote: > fixes a mem leak in esp6_input() in the error paths. and return -ENOMEM, > not -EINVAL when out of memory. against 2.6.0-test2-bk Patch applied, thanks Daniel. From davem@redhat.com Sat Aug 2 23:05:03 2003 Received: with ECARTIS (v1.0.0; list netdev); Sat, 02 Aug 2003 23:05:07 -0700 (PDT) Received: from pizda.ninka.net (IDENT:root@pizda.ninka.net [216.101.162.242]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h73652Fl004305 for ; Sat, 2 Aug 2003 23:05:03 -0700 Received: from pizda.ninka.net (IDENT:davem@localhost.localdomain [127.0.0.1]) by pizda.ninka.net (8.9.3/8.9.3) with SMTP id WAA10978; Sat, 2 Aug 2003 22:59:48 -0700 Date: Sat, 2 Aug 2003 22:59:48 -0700 From: "David S. Miller" To: Ville Nuorvala Cc: yoshfuji@linux-ipv6.org, netdev@oss.sgi.com Subject: Re: [PATCH] IPV6: Incorrect hoplimit in ip6_push_pending_frames() Message-Id: <20030802225948.01c96fb7.davem@redhat.com> In-Reply-To: References: X-Mailer: Sylpheed version 0.9.2 (GTK+ 1.2.6; sparc-unknown-linux-gnu) Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit X-archive-position: 4473 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: davem@redhat.com Precedence: bulk X-list: netdev On Fri, 1 Aug 2003 14:15:21 +0300 (EEST) Ville Nuorvala wrote: > I noticed the hop limit passed to ip6_append_data() isn't used by > ip6_push_pending_frames(), which might lead to unexpected behavior with > multicast and (ipv6-in-ipv6) tunneled packets. This patch (against Linux > 2.6.0-test2 and cset 1.1595) fixes the problem. Applied, thank you. From jgarzik@pobox.com Sat Aug 2 23:13:02 2003 Received: with ECARTIS (v1.0.0; list netdev); Sat, 02 Aug 2003 23:13:06 -0700 (PDT) Received: from www.linux.org.uk (IDENT:brcXRjmAJ7+L5OReqrzNa8QAiyIV4/9i@parcelfarce.linux.theplanet.co.uk [195.92.249.252]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h736D1Fl005185 for ; Sat, 2 Aug 2003 23:13:02 -0700 Received: from rdu26-227-011.nc.rr.com ([66.26.227.11] helo=pobox.com) by www.linux.org.uk with esmtp (Exim 4.14) id 19jC72-0002bL-FW; Sun, 03 Aug 2003 07:13:00 +0100 Message-ID: <3F2CA7E1.6060800@pobox.com> Date: Sun, 03 Aug 2003 02:12:49 -0400 From: Jeff Garzik Organization: none User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.2.1) Gecko/20021213 Debian/1.2.1-2.bunk X-Accept-Language: en MIME-Version: 1.0 To: "Feldman, Scott" CC: netdev@oss.sgi.com Subject: Re: e100 "Ferguson" release References: <3F2CA65F.8060105@pobox.com> In-Reply-To: <3F2CA65F.8060105@pobox.com> Content-Type: text/plain; charset=us-ascii; format=flowed Content-Transfer-Encoding: 7bit X-archive-position: 4474 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: jgarzik@pobox.com Precedence: bulk X-list: netdev Jeff Garzik wrote: > * (extremely minor) some people (like me :)) consider dead reads like > the readb() call in e100_write_flush er, that was a bit incomplete. completing: ... needing to be marked explicitly with a "(void) " prefix, indicating it is intentionally a dead read. Maintainer's call, ultimately, though... From alan@storlinksemi.com Sat Aug 2 23:23:08 2003 Received: with ECARTIS (v1.0.0; list netdev); Sat, 02 Aug 2003 23:23:12 -0700 (PDT) Received: from smtp013.mail.yahoo.com (smtp013.mail.yahoo.com [216.136.173.57]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h736N7Fl006172 for ; Sat, 2 Aug 2003 23:23:08 -0700 Received: from cpe-66-1-155-95.ca.sprintbbd.net (HELO AlanLap) (alansuntzishih@66.1.155.95 with login) by smtp.mail.vip.sc5.yahoo.com with SMTP; 3 Aug 2003 06:23:06 -0000 From: "Alan Shih" To: "Ben Greear" , "Jeff Garzik" Cc: "Nivedita Singhvi" , "Werner Almesberger" , , Subject: RE: TOE brain dump Date: Sat, 2 Aug 2003 23:22:52 -0700 Message-ID: MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit X-Priority: 3 (Normal) X-MSMail-Priority: Normal X-Mailer: Microsoft Outlook IMO, Build 9.0.2416 (9.0.2911.0) Importance: Normal X-MimeOLE: Produced By Microsoft MimeOLE V6.00.2727.1300 In-Reply-To: <3F2C891B.7080004@candelatech.com> X-archive-position: 4475 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: alan@storlinksemi.com Precedence: bulk X-list: netdev A DMA xfer that fills the NIC pipe with IDE source. That's not very hard... need a lot of bufferring/FIFO though. May require large modification to the file serving applications? Alan -----Original Message----- From: linux-kernel-owner@vger.kernel.org [mailto:linux-kernel-owner@vger.kernel.org]On Behalf Of Ben Greear Sent: Saturday, August 02, 2003 9:02 PM To: Jeff Garzik Cc: Nivedita Singhvi; Werner Almesberger; netdev@oss.sgi.com; linux-kernel@vger.kernel.org Subject: Re: TOE brain dump Jeff Garzik wrote: > So, fix the other end of the pipeline too, otherwise this fast network > stuff is flashly but pointless. If you want to serve up data from disk, > then start creating PCI cards that have both Serial ATA and ethernet > connectors on them :) Cut out the middleman of the host CPU and host I for one would love to see something like this, and not just Serial ATA.. but maybe 8x Serial ATA and RAID :) Ben -- Ben Greear Candela Technologies Inc http://www.candelatech.com - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ From jgarzik@pobox.com Sat Aug 2 23:40:47 2003 Received: with ECARTIS (v1.0.0; list netdev); Sat, 02 Aug 2003 23:41:00 -0700 (PDT) Received: from www.linux.org.uk (IDENT:NELjEc2FssOMkU3BYyvJ/bjDS2/IyxUi@parcelfarce.linux.theplanet.co.uk [195.92.249.252]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h736ejFl007715 for ; Sat, 2 Aug 2003 23:40:46 -0700 Received: from rdu26-227-011.nc.rr.com ([66.26.227.11] helo=pobox.com) by www.linux.org.uk with esmtp (Exim 4.14) id 19jCXs-0002jd-97; Sun, 03 Aug 2003 07:40:44 +0100 Message-ID: <3F2CAE61.7070401@pobox.com> Date: Sun, 03 Aug 2003 02:40:33 -0400 From: Jeff Garzik Organization: none User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.2.1) Gecko/20021213 Debian/1.2.1-2.bunk X-Accept-Language: en MIME-Version: 1.0 To: netdev@oss.sgi.com, linux-kernel@vger.kernel.org CC: Werner Almesberger , Nivedita Singhvi Subject: Re: TOE brain dump References: <20030802140444.E5798@almesberger.net> <3F2BF5C7.90400@us.ibm.com> <3F2C0C44.6020002@pobox.com> <20030802184901.G5798@almesberger.net> In-Reply-To: <20030802184901.G5798@almesberger.net> Content-Type: text/plain; charset=us-ascii; format=flowed Content-Transfer-Encoding: 7bit X-archive-position: 4476 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: jgarzik@pobox.com Precedence: bulk X-list: netdev Werner Almesberger wrote: > Jeff Garzik wrote: > >>jabbering at the same time. TCP is a "one size fits all" solution, but >>it doesn't work well for everyone. > > > But then, ten "optimized xxPs" that work well in two different > scenarios each, but not so good in the 98 others, wouldn't be > much fun either. > > It's been tried a number of times. Usually, real life sneaks > in at one point or another, leaving behind a complex mess. > When they've sorted out these problems, regular TCP has caught > up with the great optimized transport protocols. At that point, > they return to their niche, sometimes tail between legs and > muttering curses, sometimes shaking their fist and boldly > proclaiming how badly they'll rub TCP in the dirt in the next > round. Maybe they shed off some of the complexity, and trade it > for even more aggressive optimization, which puts them into > their niche even more firmly. Eventually, they fade away. > > There are cases where TCP doesn't work well, like a path of > badly mismatched link layers, but such paths don't treat any > protocol following the end-to-end principle kindly. > > Another problem of TCP is that it has grown a bit too many > knobs you need to turn before it works over your really fast > really long pipe. (In one of the OLS after dinner speeches, > this was quite appropriately called the "wizard gap".) > > >>It's obviously not over a WAN... > > > That's why NFS turned off UDP checksums ;-) As soon as you put > it on IP, it will crawl to distances you didn't imagine in your > wildest dreams. It always does. Really fast, really long pipes in practice don't exist for 99.9% of all Internet users. When you approach traffic levels that push you want to offload most of the TCP net stack, then TCP isn't the right solution for you anymore, all things considered. The Linux net stack just isn't built to be offloaded. TOE engines will either need to (1) fall back to Linux software for all-but-the-common case (otherwise netfilter, etc. break), or, (2) will need to be hideously complex beasts themselves. And I can't see ASIC and firmware designers being excited about implementing netfilter on a PCI card :) Unfortunately some vendors seem to choosing TOE option #3: TCP offload which introduces many limitations (connection limits, netfilter not supported, etc.) which Linux never had before. Vendors don't seem to realize TOE has real potential to damage the "good network neighbor" image the net stack has. The Linux net stack's behavior is known, documented, predictable. TOE changes all that. There is one interesting TOE solution, that I have yet to see created: run Linux on an embedded processor, on the NIC. This stripped-down Linux kernel would perform all the header parsing, checksumming, etc. into the NIC's local RAM. The Linux OS driver interface becomes a virtual interface with a large MTU, that communicates from host CPU to NIC across the PCI bus using jumbo-ethernet-like data frames. Management frames would control the ethernet interface on the other side of the PCI bus "tunnel". >>So, fix the other end of the pipeline too, otherwise this fast network >>stuff is flashly but pointless. If you want to serve up data from disk, >>then start creating PCI cards that have both Serial ATA and ethernet >>connectors on them :) Cut out the middleman of the host CPU and host >>memory bus instead of offloading portions of TCP that do not need to be >>offloaded. > > > That's a good point. A hierarchical memory structure can help > here. Moving one end closer to the hardware, and letting it > know (e.g. through sendfile) that also the other end is close > (or can be reached more directly that through some hopelessly > crowded main bus) may help too. Definitely. Jeff From jgarzik@pobox.com Sat Aug 2 23:41:50 2003 Received: with ECARTIS (v1.0.0; list netdev); Sat, 02 Aug 2003 23:41:53 -0700 (PDT) Received: from www.linux.org.uk (IDENT:r2QFDDBea5MOAjl2dLgz0iK1HyGBBbZ4@parcelfarce.linux.theplanet.co.uk [195.92.249.252]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h736fmFl008008 for ; Sat, 2 Aug 2003 23:41:49 -0700 Received: from rdu26-227-011.nc.rr.com ([66.26.227.11] helo=pobox.com) by www.linux.org.uk with esmtp (Exim 4.14) id 19jCYr-0002jy-5D; Sun, 03 Aug 2003 07:41:45 +0100 Message-ID: <3F2CAE9D.5090401@pobox.com> Date: Sun, 03 Aug 2003 02:41:33 -0400 From: Jeff Garzik Organization: none User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.2.1) Gecko/20021213 Debian/1.2.1-2.bunk X-Accept-Language: en MIME-Version: 1.0 To: Alan Shih CC: Ben Greear , Nivedita Singhvi , Werner Almesberger , netdev@oss.sgi.com, linux-kernel@vger.kernel.org Subject: Re: TOE brain dump References: In-Reply-To: Content-Type: text/plain; charset=us-ascii; format=flowed Content-Transfer-Encoding: 7bit X-archive-position: 4477 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: jgarzik@pobox.com Precedence: bulk X-list: netdev Alan Shih wrote: > A DMA xfer that fills the NIC pipe with IDE source. That's not very hard... > need a lot of bufferring/FIFO though. May require large modification to the > file serving applications? Nope, that's using the existing sendfile(2) facility. Jeff From jgarzik@pobox.com Sun Aug 3 00:00:30 2003 Received: with ECARTIS (v1.0.0; list netdev); Sun, 03 Aug 2003 00:00:36 -0700 (PDT) Received: from www.linux.org.uk (IDENT:f0VnXDXUgPOR/pJVSNM/K9f0DO93mSKY@parcelfarce.linux.theplanet.co.uk [195.92.249.252]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h7370SFl009721 for ; Sun, 3 Aug 2003 00:00:29 -0700 Received: from rdu26-227-011.nc.rr.com ([66.26.227.11] helo=pobox.com) by www.linux.org.uk with esmtp (Exim 4.14) id 19jC0o-0002Yz-G8; Sun, 03 Aug 2003 07:06:34 +0100 Message-ID: <3F2CA65F.8060105@pobox.com> Date: Sun, 03 Aug 2003 02:06:23 -0400 From: Jeff Garzik Organization: none User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.2.1) Gecko/20021213 Debian/1.2.1-2.bunk X-Accept-Language: en MIME-Version: 1.0 To: "Feldman, Scott" CC: netdev@oss.sgi.com Subject: Re: e100 "Ferguson" release References: In-Reply-To: Content-Type: text/plain; charset=us-ascii; format=flowed Content-Transfer-Encoding: 7bit X-archive-position: 4478 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: jgarzik@pobox.com Precedence: bulk X-list: netdev Comments: * Given that e100 is only 10/100 hardware, I like the decision to not support rx/tx checksumming and zero-copy. Particularly with some e100's, this eliminates various worries related to chip errata. And as with any "do it in software" solution, you guarantee that the chip never screws up and "acks" a checksum incorrectly, thus passing corrupted data up into the net stack. * (API) Does the out-of-tx-resources condition in e100_xmit_frame ever really happen? I am under the impression that returning non-zero in ->hard_start_xmit results in the packet sometimes being requeued and sometimes dropped. I prefer to guarantee a more-steady state, by simply dropping the packet unconditionally, when this uncommon condition occurs. So, I would a) mark the failure condition with unlikely(), and b) if the condition occurs, simply drop the packet (tx_dropped++, kfree skb), and return zero. Though, ultimately, I wish the net stack would support some way to _guarantee_ that the skb is requeued for transmit. Some packet schedulers in the kernel will drop the skb even if the ->hard_start_xmit return code indicates "requeue". This makes sense from the rule of "skbs are lossy, and can be dropped"... but it really sucks on hardware where unexpected -- but temporary -- loss of TX resources occurs. One can prevent 20-50% (or more) packet loss on certain classes of connections, simply by being able to tell the net stack "hey, if I could go back in time and issue a netif_stop_queue, before you called ->hard_start_xmit, I would" :) * (minor) for completeness, you should limit the PCI class in the pci_device_id table to PCI_CLASS_NETWORK_ETHERNET. There are one-in-a-million cases where this matters, but it's usually a BIOS bug. Still, it's there in pci_device_id table, and it's an easy change, so might as well use it. This is a good janitor task for other PCI net drivers, too. * (long term) I really like Ben H.'s work in drivers/net/sungem_phy.[ch] -- and similar benh code in ibm_emac -- and want to make his code generic for most MII phys. Just something to read and keep in mind. * (style) your struct config definition is terribly clever. perhaps too clever, making it unreadable? Not a specific complaint, mind you, just something that caught my eye. * (minor) in tg3, my own benchmarks and experiments showed it helped to explictly use ____cacheline_aligned markers when defining certain sections of members in struct tg3 (or struct nic, in e100's case). You already clearly pay attention to member layout WRT cache effects, but if you have a clear dividing line, or lines, in struct nic you can use _____cacheline_aligned for even greater benefit. At a minimum test it with a cpu-usage-measuring benchmark like ttcp, though, of course :) IIRC I divided tg3's struct into rx, tx, and "other" sections. * (extremely minor) some people (like me :)) consider dead reads like the readb() call in e100_write_flush * (major?) Aren't there some clunky e100 adapters that don't do MMIO? Do we care? * I would love to see feedback from people testing this driver on ppc64 and sparc64, particularly. * (style, minor) My eyes would prefer functions like e100_hw_reset to have a few more blank lines in them, spreading code+comment blocks out a bit. * (extremely minor) one wonders if you really need the write flush in mdio_ctrl. If the flush is removed, the same net effect appears to occur. * (boring but needed) convert all the magic numbers in e100_configure into constants, or at least add comments describing the magic numbers. If the value is just one bit, you might simply append "/* true */", for example. The general idea is to make the "member name = value" list a little bit more readable to somebody who doesn't know the hardware, and struct config, intimately. * IIRC Donald's MII phy scanning code scans MII phy ids like this: 1..31,0. Or maybe 1..31, and then 0 iff no MII phys were found. In general I would prefer to follow his eepro100.c probe order. Some phys need this because they will report on both phy id #0 (which is magical) and phy id #(non-zero). Donald would know more than me, here. * I like the e100_exec_cb stuff, with the callbacks. * Is it easy to support MII phy interrupts? It would be nice to get a callback that was handled immediately, on phys that do support such interrupts. * do we care about spinlocks around the update_stats and get_stats code? * (bugs) in e100_up, you should undo mod_timer [major] and netif_start_queue [minor], if request_irq fails. And maybe stop the receiver, too? * for all constants 0xffffffff (and others as well if you so choose), prefer the C99 suffix to a cast. This is particularly relevant in pci_set_dma_mask calls, where one should be using 0xffffffffULL, but applies to other constants as well. * (potential races) in e100_probe, you want to call register_netdev as basically the last operation that can fail, if possible. Particularly, you need to move the PCI API operations above register_netdev. Remember, register_netdev winds up calling /sbin/hotplug, which in turn calls programs that will want to start using the interface. So you need to have everything set up by that point, really. * in e100_probe, "if(nic->csr == 0UL) {" should really just test for NULL, because ioremap is defined to return a pointer... * (minor) use a netif_msg_xxx wrapper/constant in e100_init_module test? From greearb@candelatech.com Sun Aug 3 00:32:09 2003 Received: with ECARTIS (v1.0.0; list netdev); Sun, 03 Aug 2003 00:32:19 -0700 (PDT) Received: from grok.yi.org (evrtwa1-ar2-4-33-045-074.evrtwa1.dsl-verizon.net [4.33.45.74]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h737W8Fl012468 for ; Sun, 3 Aug 2003 00:32:09 -0700 Received: from candelatech.com (localhost.localdomain [127.0.0.1]) by grok.yi.org (8.12.8/8.12.8) with ESMTP id h737W1tf031554; Sun, 3 Aug 2003 00:32:01 -0700 Message-ID: <3F2CBA71.2070503@candelatech.com> Date: Sun, 03 Aug 2003 00:32:01 -0700 From: Ben Greear Organization: Candela Technologies User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.5a) Gecko/20030718 X-Accept-Language: en-us, en MIME-Version: 1.0 To: Jeff Garzik CC: "Feldman, Scott" , netdev@oss.sgi.com Subject: Re: e100 "Ferguson" release References: <3F2CA65F.8060105@pobox.com> In-Reply-To: <3F2CA65F.8060105@pobox.com> Content-Type: text/plain; charset=us-ascii; format=flowed Content-Transfer-Encoding: 7bit X-archive-position: 4479 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: greearb@candelatech.com Precedence: bulk X-list: netdev Jeff Garzik wrote: > Comments: > * (API) Does the out-of-tx-resources condition in e100_xmit_frame ever > really happen? I am under the impression that returning non-zero in > ->hard_start_xmit results in the packet sometimes being requeued and > sometimes dropped. I prefer to guarantee a more-steady state, by simply > dropping the packet unconditionally, when this uncommon condition > occurs. So, I would > a) mark the failure condition with unlikely(), and > b) if the condition occurs, simply drop the packet (tx_dropped++, kfree > skb), and return zero. > > Though, ultimately, I wish the net stack would support some way to > _guarantee_ that the skb is requeued for transmit. Some packet > schedulers in the kernel will drop the skb even if the ->hard_start_xmit > return code indicates "requeue". This makes sense from the rule of > "skbs are lossy, and can be dropped"... but it really sucks on hardware > where unexpected -- but temporary -- loss of TX resources occurs. One > can prevent 20-50% (or more) packet loss on certain classes of > connections, simply by being able to tell the net stack "hey, if I could > go back in time and issue a netif_stop_queue, before you called > ->hard_start_xmit, I would" :) Although I have not tried this latest patch, the existing e100 and e1000 in 2.4.21 seldom seem to return true to this method: netif_queue_stopped(odev), even when the next hard_start_xmit() call fails. For instance, this is the code I use in pktgen.c: if (!netif_queue_stopped(odev)) { if (odev->hard_start_xmit(next->skb, odev)) { if (net_ratelimit()) { printk(KERN_INFO "Hard xmit error\n"); } next->errors++; next->last_ok = 0; queue_stopped++; } else { queue_stopped = 0; next->last_ok = 1; next->sofar++; next->tx_bytes += (next->cur_pkt_size + 4); /* count csum */ } With e100 and e1000, I see the very large numbers of the hard_start_xmit failure when running very high packets-per-second rates (small packets). I see virtually no failures with tulip. pktgen knows how to re-queue, but it's curious it has to so often. For code that does not requeue, this could be even more of a bummer. To point b), I think if the driver accepts the packet in hard_start_xmit, it should be able to send the packet out, otherwise return the 'requeue' value and let the calling code know. It is very important to me, at least, to know if a packet has really been sent or not. Thanks, Ben -- Ben Greear Candela Technologies Inc http://www.candelatech.com From davem@redhat.com Sun Aug 3 00:36:54 2003 Received: with ECARTIS (v1.0.0; list netdev); Sun, 03 Aug 2003 00:37:01 -0700 (PDT) Received: from pizda.ninka.net (IDENT:root@pizda.ninka.net [216.101.162.242]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h737aqFl013180 for ; Sun, 3 Aug 2003 00:36:53 -0700 Received: from pizda.ninka.net (IDENT:davem@localhost.localdomain [127.0.0.1]) by pizda.ninka.net (8.9.3/8.9.3) with SMTP id AAA11387; Sun, 3 Aug 2003 00:32:39 -0700 Date: Sun, 3 Aug 2003 00:32:39 -0700 From: "David S. Miller" To: Ben Greear Cc: jgarzik@pobox.com, scott.feldman@intel.com, netdev@oss.sgi.com Subject: Re: e100 "Ferguson" release Message-Id: <20030803003239.4257ef24.davem@redhat.com> In-Reply-To: <3F2CBA71.2070503@candelatech.com> References: <3F2CA65F.8060105@pobox.com> <3F2CBA71.2070503@candelatech.com> X-Mailer: Sylpheed version 0.9.2 (GTK+ 1.2.6; sparc-unknown-linux-gnu) Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit X-archive-position: 4480 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: davem@redhat.com Precedence: bulk X-list: netdev > Although I have not tried this latest patch, the existing e100 and e1000 in > 2.4.21 seldom seem to return true to this method: netif_queue_stopped(odev), > even when the next hard_start_xmit() call fails. Returning an error from hard_start_xmit() from normal ethernet drivers is, frankly, a driver bug and should never happen. I don't know if there is somehow something special about the e100, but even if there is hard_start_xmit() failures can be avoided by properly doing netif_queue_{stop,wakeup}() in the right places. From david.lang@digitalinsight.com Sun Aug 3 01:27:39 2003 Received: with ECARTIS (v1.0.0; list netdev); Sun, 03 Aug 2003 01:27:50 -0700 (PDT) Received: from warden.diginsite.com (warden-p.diginsite.com [208.29.163.248]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h738RcFl017767 for ; Sun, 3 Aug 2003 01:27:39 -0700 Received: from wlvims01.diginsite.com by warden.diginsite.com via smtpd (for oss.SGI.COM [192.48.159.27]) with SMTP; Sun, 3 Aug 2003 01:27:38 -0700 Received: from calexc01.digitalinsight.com ([10.200.0.20]) by wlvims01.digitalinsight.com (Post.Office MTA v3.5.3 release 223 ID# 0-0U10L2S100V35) with ESMTP id com; Sun, 3 Aug 2003 01:26:48 -0700 Received: by calexc01.diginsite.com with Internet Mail Service (5.5.2653.19) id ; Sun, 3 Aug 2003 01:27:31 -0700 Received: from dlang.diginsite.com ([10.201.10.67]) by wlvexc00.digitalinsight.com with SMTP (Microsoft Exchange Internet Mail Service Version 5.5.2656.59) id P0FXW757; Sun, 3 Aug 2003 01:27:21 -0700 From: David Lang To: Alan Shih Cc: Ben Greear , Jeff Garzik , Nivedita Singhvi , Werner Almesberger , netdev@oss.sgi.com, linux-kernel@vger.kernel.org Date: Sun, 3 Aug 2003 01:25:48 -0700 (PDT) Subject: RE: TOE brain dump In-Reply-To: Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII X-archive-position: 4481 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: david.lang@digitalinsight.com Precedence: bulk X-list: netdev do you really want the processor on the card to be tunning apache/NFS/Samba/etc ? putting enough linux on the card to act as a router (which would include the netfilter stuff) is one thing. putting the userspace code that interfaces with the outside world for file transfers is something else. if you really want the disk connected to your network card you are just talking a low-end linux box. forget all this stuff about it being on a card and just use a full box (economys of scale will make this cheaper) making a firewall that's a core system with a dozen slave systems attached to it (the network cards) sounds like the type of clustering that Linux has been used for for compute nodes. complicated to setup, but extremely powerful and scalable once configured. if you want more then a router on the card then Alan Cox is right, just add another processor to the system, it's easier and cheaper. David Lang On Sat, 2 Aug 2003, Alan Shih wrote: > Date: Sat, 2 Aug 2003 23:22:52 -0700 > From: Alan Shih > To: Ben Greear , Jeff Garzik > Cc: Nivedita Singhvi , > Werner Almesberger , netdev@oss.sgi.com, > linux-kernel@vger.kernel.org > Subject: RE: TOE brain dump > > A DMA xfer that fills the NIC pipe with IDE source. That's not very hard... > need a lot of bufferring/FIFO though. May require large modification to the > file serving applications? > > Alan > > -----Original Message----- > From: linux-kernel-owner@vger.kernel.org > [mailto:linux-kernel-owner@vger.kernel.org]On Behalf Of Ben Greear > Sent: Saturday, August 02, 2003 9:02 PM > To: Jeff Garzik > Cc: Nivedita Singhvi; Werner Almesberger; netdev@oss.sgi.com; > linux-kernel@vger.kernel.org > Subject: Re: TOE brain dump > > > Jeff Garzik wrote: > > > So, fix the other end of the pipeline too, otherwise this fast network > > stuff is flashly but pointless. If you want to serve up data from disk, > > then start creating PCI cards that have both Serial ATA and ethernet > > connectors on them :) Cut out the middleman of the host CPU and host > > I for one would love to see something like this, and not just Serial ATA.. > but maybe 8x Serial ATA and RAID :) > > Ben > > > -- > Ben Greear > Candela Technologies Inc http://www.candelatech.com > > > - > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ > > - > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ > From willy@www.linux.org.uk Sun Aug 3 07:57:00 2003 Received: with ECARTIS (v1.0.0; list netdev); Sun, 03 Aug 2003 07:57:05 -0700 (PDT) Received: from www.linux.org.uk (IDENT:dxP8jV70pveQkX3/KA355Yf9YCtbJJSj@parcelfarce.linux.theplanet.co.uk [195.92.249.252]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h73EuwFl025115 for ; Sun, 3 Aug 2003 07:57:00 -0700 Received: from willy by www.linux.org.uk with local (Exim 4.14) id 19jKI4-0006o5-8W; Sun, 03 Aug 2003 15:56:56 +0100 Date: Sun, 3 Aug 2003 15:56:56 +0100 From: Matthew Wilcox To: Jeff Garzik Cc: Matthew Wilcox , netdev@oss.sgi.com Subject: Re: [PATCH] ethtool_ops rev 4 Message-ID: <20030803145656.GI22222@parcelfarce.linux.theplanet.co.uk> References: <20030801150232.GV22222@parcelfarce.linux.theplanet.co.uk> <20030801154021.GA7696@gtf.org> <20030801154656.GW22222@parcelfarce.linux.theplanet.co.uk> <20030801162536.GA18574@gtf.org> <20030802222145.GE22222@parcelfarce.linux.theplanet.co.uk> <3F2C3C86.6000202@pobox.com> <20030803002744.GF22222@parcelfarce.linux.theplanet.co.uk> <3F2C7E12.8070904@pobox.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <3F2C7E12.8070904@pobox.com> User-Agent: Mutt/1.4.1i X-archive-position: 4482 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: willy@debian.org Precedence: bulk X-list: netdev On Sat, Aug 02, 2003 at 11:14:26PM -0400, Jeff Garzik wrote: > Matthew Wilcox wrote: > >Nothing stops it being implemented as a macro in kcompat. Having it as > >an inline function gives it argument typechecking which always gives me > >the warm fuzzies. > > No, it _needs_ to be a macro for maximum flexibility. > > Most importantly, kcompat code may use '#ifndef SET_ETHTOOL_OPS' as a > trigger, to signal that compat code is needed. No need for drivers to > create tons of kernel-version-code ifdefs, just to test for when > ethtool_ops appeared in 2.6, for when it starts appearing in 2.4 vendor > backports, and (possibly) 2.4 itself. Also, doing it at the cpp level > allows compat code to #undef it, if it _really_ knows what its doing, > and the situation calls for it. OK. At this point, I really feel like I'm getting in the way and hindering more than I'm helping. Can I pass the torch to you and let you finish the job? -- "It's not Hollywood. War is real, war is primarily not about defeat or victory, it is about death. I've seen thousands and thousands of dead bodies. Do you think I want to have an academic debate on this subject?" -- Robert Fisk From wsx@6com.sk Sun Aug 3 08:44:34 2003 Received: with ECARTIS (v1.0.0; list netdev); Sun, 03 Aug 2003 08:44:40 -0700 (PDT) Received: from mail.6com.sk (cement.ksp.edi.fmph.uniba.sk [158.195.16.151]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h73FiXFl028554 for ; Sun, 3 Aug 2003 08:44:33 -0700 Received: by mail.6com.sk (Postfix, from userid 501) id 84173630E; Sun, 3 Aug 2003 17:44:27 +0200 (CEST) Date: Sun, 3 Aug 2003 17:44:27 +0200 From: Jan Oravec To: netdev@oss.sgi.com Subject: problem setting net.ipvX.conf.all.forwarding via sysctl() system call Message-ID: <20030803154427.GA12926@wsx.ksp.sk> Reply-To: Jan Oravec Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.4.1i X-Operating-System: UNIX X-archive-position: 4483 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: jan.oravec@6com.sk Precedence: bulk X-list: netdev Hello, When net.ipvX.conf.all.forwarding is enabled via sysctl() system call, forwarding is not enabled on all interfaces as it is when it is changed using /proc filesystem. For IPv6, it is obviously because sysctl 'strategy' handler is not defined. For IPv4, it is because ipv4_sysctl_forward_strategy only copy new value to check whether it has changed and does not update ipv4_devconf.forwarding before calling inet_forward_change(). (it is copied internally by sysctl after ipv4_sysctl_forward_strategy because we return positive number) I am not good in kernel parallel computing strategy, whether it requires some locking or it is safe to do: --- sysctl_net_ipv4.c.old 2003-08-03 17:37:44.000000000 +0200 +++ sysctl_net_ipv4.c 2003-08-03 17:38:18.000000000 +0200 @@ -109,8 +109,9 @@ static int ipv4_sysctl_forward_strategy( } } + ipv4_devconf.forwarding=new; inet_forward_change(); - return 1; + return 0; } ctl_table ipv4_table[] = { Best Regards, -- Jan Oravec XS26 coordinator 6COM s.r.o. 'Access to IPv6' http://www.6com.sk http://www.xs26.net From jgarzik@pobox.com Sun Aug 3 11:00:28 2003 Received: with ECARTIS (v1.0.0; list netdev); Sun, 03 Aug 2003 11:00:36 -0700 (PDT) Received: from www.linux.org.uk (IDENT:sjW7I344dNw6uUHLQE/tIbYI710RnggI@parcelfarce.linux.theplanet.co.uk [195.92.249.252]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h73I0RFl007871 for ; Sun, 3 Aug 2003 11:00:28 -0700 Received: from rdu26-227-011.nc.rr.com ([66.26.227.11] helo=pobox.com) by www.linux.org.uk with esmtp (Exim 4.14) id 19jMMR-0007pt-LH; Sun, 03 Aug 2003 18:09:35 +0100 Message-ID: <3F2D41B7.7040205@pobox.com> Date: Sun, 03 Aug 2003 13:09:11 -0400 From: Jeff Garzik Organization: none User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.2.1) Gecko/20021213 Debian/1.2.1-2.bunk X-Accept-Language: en MIME-Version: 1.0 To: Matthew Wilcox CC: netdev@oss.sgi.com Subject: Re: [PATCH] ethtool_ops rev 4 References: <20030801150232.GV22222@parcelfarce.linux.theplanet.co.uk> <20030801154021.GA7696@gtf.org> <20030801154656.GW22222@parcelfarce.linux.theplanet.co.uk> <20030801162536.GA18574@gtf.org> <20030802222145.GE22222@parcelfarce.linux.theplanet.co.uk> <3F2C3C86.6000202@pobox.com> <20030803002744.GF22222@parcelfarce.linux.theplanet.co.uk> <3F2C7E12.8070904@pobox.com> <20030803145656.GI22222@parcelfarce.linux.theplanet.co.uk> In-Reply-To: <20030803145656.GI22222@parcelfarce.linux.theplanet.co.uk> Content-Type: text/plain; charset=us-ascii; format=flowed Content-Transfer-Encoding: 7bit X-archive-position: 4484 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: jgarzik@pobox.com Precedence: bulk X-list: netdev Matthew Wilcox wrote: > On Sat, Aug 02, 2003 at 11:14:26PM -0400, Jeff Garzik wrote: > >>Matthew Wilcox wrote: >> >>>Nothing stops it being implemented as a macro in kcompat. Having it as >>>an inline function gives it argument typechecking which always gives me >>>the warm fuzzies. >> >>No, it _needs_ to be a macro for maximum flexibility. >> >>Most importantly, kcompat code may use '#ifndef SET_ETHTOOL_OPS' as a >>trigger, to signal that compat code is needed. No need for drivers to >>create tons of kernel-version-code ifdefs, just to test for when >>ethtool_ops appeared in 2.6, for when it starts appearing in 2.4 vendor >>backports, and (possibly) 2.4 itself. Also, doing it at the cpp level >>allows compat code to #undef it, if it _really_ knows what its doing, >>and the situation calls for it. > > > OK. At this point, I really feel like I'm getting in the way and > hindering more than I'm helping. Can I pass the torch to you and let > you finish the job? Sorry to give that impression :( I think we're pretty much "there". But if you wanna hand it off to me for the last little bits, and merging, that's fine too. I'll leave it up to you. Jeff From werner@almesberger.net Sun Aug 3 11:05:57 2003 Received: with ECARTIS (v1.0.0; list netdev); Sun, 03 Aug 2003 11:06:00 -0700 (PDT) Received: from host.almesberger.net (almesberger.net [63.105.73.239] (may be forged)) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h73I5uFl008556 for ; Sun, 3 Aug 2003 11:05:57 -0700 Received: from almesberger.net (vpnwa-home [10.200.0.2]) by host.almesberger.net (8.11.6/8.9.3) with ESMTP id h73I5oG04155; Sun, 3 Aug 2003 11:05:50 -0700 Received: (from werner@localhost) by almesberger.net (8.11.6/8.11.6) id h73I5c110505; Sun, 3 Aug 2003 15:05:38 -0300 Date: Sun, 3 Aug 2003 15:05:37 -0300 From: Werner Almesberger To: David Lang Cc: Alan Shih , Ben Greear , Jeff Garzik , Nivedita Singhvi , netdev@oss.sgi.com, linux-kernel@vger.kernel.org Subject: Re: TOE brain dump Message-ID: <20030803150537.C10280@almesberger.net> References: Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: ; from david.lang@digitalinsight.com on Sun, Aug 03, 2003 at 01:25:48AM -0700 X-archive-position: 4485 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: werner@almesberger.net Precedence: bulk X-list: netdev David Lang wrote: > do you really want the processor on the card to be tunning > apache/NFS/Samba/etc ? If it runs a Linux kernel, that's not a problem. Whether you actually want to do this or not, becomes an entirely separate issue. - Werner -- _________________________________________________________________________ / Werner Almesberger, Buenos Aires, Argentina werner@almesberger.net / /_http://www.almesberger.net/____________________________________________/ From hadi@cyberus.ca Sun Aug 3 11:15:10 2003 Received: with ECARTIS (v1.0.0; list netdev); Sun, 03 Aug 2003 11:15:19 -0700 (PDT) Received: from mail.cyberus.ca (mail.cyberus.ca [209.195.118.111]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h73IF9Fl009421 for ; Sun, 3 Aug 2003 11:15:10 -0700 Received: from cpe0030ab124d2f-cm014500000962.cpe.net.cable.rogers.com ([24.103.99.32] helo=[10.0.0.9]) by mail.cyberus.ca with esmtp (Exim 4.12) id 19jNNq-000OgI-00; Sun, 03 Aug 2003 14:15:06 -0400 Subject: Re: [RFC] High Performance Packet Classifiction for tc framework From: jamal Reply-To: hadi@cyberus.ca To: Michael Bellion and Thomas Heinz Cc: linux-net@vger.kernel.org, netdev@oss.sgi.com In-Reply-To: <3F16A0E5.1080007@hipac.org> References: <200307141045.40999.nf@hipac.org> <1058328537.1797.24.camel@jzny.localdomain> <3F16A0E5.1080007@hipac.org> Content-Type: text/plain Organization: jamalopolis Message-Id: <1059934468.1103.41.camel@jzny.localdomain> Mime-Version: 1.0 X-Mailer: Ximian Evolution 1.2.2 Date: 03 Aug 2003 14:14:28 -0400 Content-Transfer-Encoding: 7bit X-archive-position: 4486 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: hadi@cyberus.ca Precedence: bulk X-list: netdev Hi, Apologies for late response. Its funny how i thought i was going to have more time in the last 2 weeks but due to bad scheduling that wasnt the case. On Thu, 2003-07-17 at 09:13, Michael Bellion and Thomas Heinz wrote: > Hi Jamal > > You wrote: > > This is good.I may have emailed you about this topic before? > > Yes, but at that time we had not any concrete plans to > integrate hipac into tc. We focussed on making nf-hipac as > expressive as iptables first. > Good goal. > > It's a classifier therefore it makes sense ;-> > > :-) > > > nice. What would be interesting is to see your rule update rates vs > > iptables (i expect iptables to suck) - but how do you compare aginst any > > of the tc classifiers for example? > > Regarding the rule update rates we have not done any measurements > yet but nf-hipac should be faster than iptables (even more when > we have implemented the selective cloning stuff). On the other > hand we are probably slower than tc because in addition to the > insert operation into an internal chain there is the actual hipac > insert operation. The insertion in the internal chain is quicker > than the tc insert operation because we use doubly linked lists. > I think i will have to look at your code to make comments. > Regarding the matching performance one has to consider a few things. > The currently existing tc classifiers are an abstraction for rules > (iptables "slang") whilst hipac is an abstraction for a set of rules > (including the chain semantics known from iptables), i.e. a table in > the iptables world. Not entirely accurate. Depends which tc classifier. u32 hash tables are infact like iptables chains. Note, the concept of priorities which is used for conflict resolution as well as further separating sets of rules doesnt exist in iptables. > Of course it is possible to have some sort > of extended classifying in tc too, True, i overlooked this. > i.e. you can add several fw or u32 > filters with the same prio which allows the filters to be hashed. You can also have them use different priorities and with the continue operator first clasify based on packet data then on metadata or on another packet header filter. > One disadvantage of this concept is that the hashed filters > must be compact, i.e. there cannot be other classifiers in between. I didnt understand this. Are you talking about conflict resolving of overlapping filters? > Another major disadvantage is caused by the hashing scheme. > If you use the hash for 1 dimension you have to make sure that > either all filters in a certain bucket are disjoint or you must have > an implicit ordering of the rules (according to the insertion order > or something). This scheme is not extendable to 2 or more dimensions, > i.e. 1 hash for src ip, #(src ip buckets) many dst ip hashes and so > on, because you simply cannot express arbitrary rulesets. If i understood you - you are refering to a way to reduce the number of lookups by having disjoint hashes. I suppose for something as simple as a five tuple lookup, this is almost solvable by hardcoding the different fields into multiway hashes. Its when you try to generalize that it becomes an issue. > Another general problem is of course that the user has to manually > setup the hash which is rather inconvenient. > Yes. Take a look at Werners tcng - he has a clever way to hide things from the user. I did experimentation on u32 with a kernel thread which rearranged things when they seemed out of balance but i havent experimented with a lot of rules. > Now, what are the implications on the matching performance: > tc vs. nf-hipac? As long as the extended hashing stuff is not used > nf-hipac is clearly superior to tc. You are refering to u32. You mean as long as u32 stored things in a single linked list, you win - correct? > When hashing is used it _really_ > depends. If there is only one classifier (with hashing) per interface > and the number of rules per bucket is very small the performance should > be comparable. As soon as you add other classifiers nf-hipac will > outperform tc again. > If we take a simple user interface abstraction like tcng which hides the evil of u32 and then take simple 5 tuple rules - i doubt you will see any difference. For more generic setup, the kernel thread i refer to would work - but may slow insertion. > >>The tc framework is very flexible with respect to where filters can be > >>attached. Unfortunately this cannot be mapped into one HIPAC data > >>structure. Our current design allows to attach filters anywhere but > >>only the filters attached to the top level qdisc would benefit from the > >>HIPAC algorithm. Would this be a noticeable restriction? > > > > I dont think so, but can ytou describe this restriction? > > Well, we thought a little more about the design and came to the > conclusion that it is not necessary to have a HIPAC qdisc at root > but it suffices to ensure that the HIPAC classifier occurs only > once per interface. As you can guess from the last sentence we > dropped the HIPAC qdisc design and changed to the following scheme: > > - there no special HIPAC qdisc at all :-) > - the HIPAC classifier is no longer a simple rule but represents > the whole table > - the HIPAC classifier can occur in any qdisc but at most once > per interface > > So, basically HIPAC is just a normal classifier like any other > with two exceptions: > a) it can occur only once per interface > b) the rules within the classifier can contain other classifiers, > e.g. u32, fw, tc_index, as matches > But why restriction a)? Also why should we need hipac to hold other filters when the infrastructure itself can hold the extended filters just fine? I think you may actually be trying to say why somewhere in the email, but it must not be making a significant impression on my brain. > There is just one problem with the current tc framework. Once > a new filter is inserted into the chain it is not removed even > if the change function of the classifier returns < 0 > (2.6.0-test1: net/sched/cls_api.c: line 280f). > This should be changed anyway, shouldn't it? > Are you refering to this piece of code?: ---- err = tp->ops->change(tp, cl, t->tcm_handle, tca, &fh); if (err == 0) tfilter_notify(skb, n, tp, fh, RTM_NEWTFILTER); errout: if (cl) cops->put(q, cl); return err; --- change() should not return <0 if it has installed the filter i think. Should the top level code be responsible for removing filters? > >>- new HIPAC classifier which supports all native nf-hipac matches > >> (src/dst ip, proto, src/dst port, ttl, state, in_iface, icmp type, > >> tcpflags, fragments) and additionally fwmark > > > > I would think for cleanliness fwmark or any metadata related > > classification would be separate from one that is based on packet bits. > > Since our classifier represents a table of rules and the rules are > based on different matches, like src/dst ip and also fwmark (native) > or u32 (subclassifier as match), this is definitely a clean design. > I think we need to have the infrastructure in the main tc code. Its already there - may not be very clean right now. > >>- the HIPAC classifier can only be attached to the HIPAC qdisc and vice > >> versa the HIPAC qdisc only accepts HIPAC classifiers > > > > > > We do have an issue with being able to do extended classification > > but building a qdisc for it is a no no. Building a qdisc that will force > > other classifier to structure themselves after it is even a bigger sin. > > Look at the action code i have (i can send you an updated patch); a > > better idea is to make extended classifiers an action based on another > > filter match. At least this is what i have been toying with and i dont > > think it is clean enough. what we need is to extend the filtering > > framework itself to have extended classifiers. > > The new design should be much cleaner. Originally we also thought about > making HIPAC a classifier only but we expected some problems related > to this approach. Finally we discovered that this is not the case :) > Consider what i said above. I'll try n cobble together some examples to demonstrate (although it seems you already know this). To allow for anyone to install classifiers-du-jour without being dependet on hipac would be very useful. So ideas that you have for enabling this cleanly should be moved to cls_api. cheers, jamal From andersen@codepoet.org Sun Aug 3 11:28:03 2003 Received: with ECARTIS (v1.0.0; list netdev); Sun, 03 Aug 2003 11:28:14 -0700 (PDT) Received: from winder.codepoet.org (postfix@codepoet.org [166.70.99.138]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h73IS2Fl010527 for ; Sun, 3 Aug 2003 11:28:03 -0700 Received: by winder.codepoet.org (Codepoet.org Mail Daemon, from userid 1000) id 2E763157577; Sun, 3 Aug 2003 12:27:56 -0600 (MDT) Date: Sun, 3 Aug 2003 12:27:55 -0600 From: Erik Andersen To: Werner Almesberger Cc: Jeff Garzik , netdev@oss.sgi.com, linux-kernel@vger.kernel.org, Nivedita Singhvi Subject: Re: TOE brain dump Message-ID: <20030803182755.GA16770@codepoet.org> Reply-To: andersen@codepoet.org Mail-Followup-To: Erik Andersen , Werner Almesberger , Jeff Garzik , netdev@oss.sgi.com, linux-kernel@vger.kernel.org, Nivedita Singhvi References: <20030802140444.E5798@almesberger.net> <3F2BF5C7.90400@us.ibm.com> <3F2C0C44.6020002@pobox.com> <20030802184901.G5798@almesberger.net> <3F2CAE61.7070401@pobox.com> <20030803145737.B10280@almesberger.net> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20030803145737.B10280@almesberger.net> User-Agent: Mutt/1.3.28i X-Operating-System: Linux 2.4.19-rmk7, Rebel-NetWinder(Intel StrongARM 110 rev 3), 185.95 BogoMips X-No-Junk-Mail: I do not want to get *any* junk mail. X-archive-position: 4487 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: andersen@codepoet.org Precedence: bulk X-list: netdev On Sun Aug 03, 2003 at 02:57:37PM -0300, Werner Almesberger wrote: > > There is one interesting TOE solution, that I have yet to see created: > > run Linux on an embedded processor, on the NIC. > > That's basically what I've been talking about all the > while :-) http://www.snapgear.com/pci630.html -Erik -- Erik B. Andersen http://codepoet-consulting.com/ --This message was written using 73% post-consumer electrons-- From akpm@osdl.org Sun Aug 3 12:01:38 2003 Received: with ECARTIS (v1.0.0; list netdev); Sun, 03 Aug 2003 12:01:43 -0700 (PDT) Received: from mail.osdl.org (fw.osdl.org [65.172.181.6]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h73J1bFl013068 for ; Sun, 3 Aug 2003 12:01:37 -0700 Received: from mnm (build.pdx.osdl.net [172.20.1.2]) by mail.osdl.org (8.11.6/8.11.6) with ESMTP id h73J17I30784; Sun, 3 Aug 2003 12:01:07 -0700 Date: Sun, 3 Aug 2003 12:02:23 -0700 From: Andrew Morton To: Stephen Rothwell Cc: netdev@oss.sgi.com, janfrode@parallab.no Subject: Fw: [Bugme-new] [Bug 1036] New: Badness in local_bh_enable at kernel/softirq.c:113 Message-Id: <20030803120223.738a7453.akpm@osdl.org> X-Mailer: Sylpheed version 0.9.4 (GTK+ 1.2.10; i686-pc-linux-gnu) Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit X-archive-position: 4488 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: akpm@osdl.org Precedence: bulk X-list: netdev (The "badness" warning is a tty locking problem. It does not explain the pptp client failures) Begin forwarded message: Date: Sun, 3 Aug 2003 04:53:31 -0700 From: bugme-daemon@osdl.org To: bugme-new@lists.osdl.org Subject: [Bugme-new] [Bug 1036] New: Badness in local_bh_enable at kernel/softirq.c:113 http://bugme.osdl.org/show_bug.cgi?id=1036 Summary: Badness in local_bh_enable at kernel/softirq.c:113 Kernel Version: 2.6.0-test2 Status: NEW Severity: high Owner: bugme-janitors@lists.osdl.org Submitter: janfrode@parallab.no Distribution: gentoo Hardware Environment: AMD AthlonXP Software Environment: ppp-2.4.1-r14 pptpclient-1.2.0 Problem Description: My pptp client connections keeps dying, syslogging: Aug 3 13:35:36 [pppd] Using interface ppp0 Aug 3 13:35:36 [pppd] Connect: ppp0 <--> /dev/pts/4 Aug 3 13:35:36 [/etc/hotplug/net.agent] NET add event not supported Aug 3 13:35:38 [pptp] anon log[decaps_hdlc:pptp_gre.c:198]: PPP mode seems to be Asynchronous._ Aug 3 13:35:39 [pppd] Remote message: Welcome^M^J Aug 3 13:35:41 [pppd] local IP address 129.177.43.23 Aug 3 13:35:41 [pppd] remote IP address 129.177.43.1 Aug 3 13:36:07 [pppd] Unsupported protocol 0xd44a received Aug 3 13:36:57 [pppd] Unsupported protocol 0xcc4a received aug 3 13:38:20 [su(pam_unix)] session opened for user root by (uid=1001) Aug 3 13:39:21 [anacron] Job `cron.daily' started Aug 3 13:39:29 [crontab] (root) LIST (root)_ Aug 3 13:39:37 [pptp] anon warn[decaps_gre:pptp_gre.c:300]: short read (-1): Message too long Aug 3 13:39:37 [pptp] anon log[callmgr_main:pptp_callmgr.c:234]: Closing connection Aug 3 13:39:37 [pptp] anon log[pptp_conn_close:pptp_ctrl.c:308]: Closing PPTP connection Aug 3 13:39:39 [pptp] anon log[call_callback:pptp_callmgr.c:74]: Closing connection Aug 3 13:39:39 [pppd] Hangup (SIGHUP) Aug 3 13:39:39 [kernel] Badness in local_bh_enable at kernel/softirq.c:113 Aug 3 13:39:39 [pppd] Modem hangup Aug 3 13:39:39 [pppd] Connection terminated. Aug 3 13:39:39 [pppd] Connect time 4.1 minutes. Aug 3 13:39:39 [pppd] Sent 310556 bytes, received 1615363 bytes. Aug 3 13:39:39 [/etc/hotplug/net.agent] NET remove event not supported Aug 3 13:39:39 [pppd] Failed to open /dev/pts/4: No such file or directory - Last output repeated 9 times - Aug 3 13:39:39 [pppd] Exit. And giving this call trace in the kernel log: Badness in local_bh_enable at kernel/softirq.c:113 Call Trace: [] local_bh_enable+0x88/0x90 [] ppp_async_push+0xa4/0x1b0 [] __lookup_hash+0x64/0xd0 [] ppp_asynctty_wakeup+0x31/0x60 [] pty_unthrottle+0x56/0x60 [] check_unthrottle+0x3a/0x40 [] n_tty_flush_buffer+0x14/0x50 [] pty_flush_buffer+0x5e/0x60 [] do_tty_hangup+0x3ac/0x420 [] release_dev+0x5b3/0x600 [] snd_pcm_oss_init_substream+0x50/0x90 [] zap_pmd_range+0x4e/0x70 [] unmap_page_range+0x4e/0x90 [] tty_release+0x2b/0x60 [] __fput+0xce/0xe0 [] filp_close+0x4b/0x80 [] put_files_struct+0x6c/0xe0 [] do_exit+0x165/0x340 [] sys_exit+0x15/0x20 [] syscall_call+0x7/0xb Steps to reproduce: Don't know how to trigger it, but it happens all the time. ------- You are receiving this mail because: ------- You are on the CC list for the bug, or are watching someone who is. From hadi@cyberus.ca Sun Aug 3 12:06:46 2003 Received: with ECARTIS (v1.0.0; list netdev); Sun, 03 Aug 2003 12:06:51 -0700 (PDT) Received: from mail.cyberus.ca (mail.cyberus.ca [209.195.118.111]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h73J6jFl013745 for ; Sun, 3 Aug 2003 12:06:46 -0700 Received: from cpe0030ab124d2f-cm014500000962.cpe.net.cable.rogers.com ([24.103.99.32] helo=[10.0.0.9]) by mail.cyberus.ca with esmtp (Exim 4.12) id 19jOBo-0002VY-00; Sun, 03 Aug 2003 15:06:44 -0400 Subject: Re: multiple unicast mac address (was Re: netdev_ops retraction) From: jamal Reply-To: hadi@cyberus.ca To: Rick Payne Cc: Jeff Garzik , netdev@oss.sgi.com In-Reply-To: <2147483647.1059667766@fozzy.rossfell.co.uk> References: <20030730184416.GI22222@parcelfarce.linux.theplanet.co.uk> <2147483647.1059659359@fozzy.rossfell.co.uk> <3F292B38.4070508@pobox.com> <2147483647.1059667766@fozzy.rossfell.co.uk> Content-Type: text/plain Organization: jamalopolis Message-Id: <1059937567.1102.77.camel@jzny.localdomain> Mime-Version: 1.0 X-Mailer: Ximian Evolution 1.2.2 Date: 03 Aug 2003 15:06:07 -0400 Content-Transfer-Encoding: 7bit X-archive-position: 4489 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: hadi@cyberus.ca Precedence: bulk X-list: netdev Last discussion that happened: http://marc.theaimsgroup.com/?t=104163060100001&r=1&w=2 cheers, jamal On Thu, 2003-07-31 at 11:09, Rick Payne wrote: > --On Thursday, July 31, 2003 10:44 am -0400 Jeff Garzik > wrote: > > > Hardware that filters N MAC addresses (unicast filtering) doesn't have a > > terribly standard interface, and the unicast filter must be adjusted at > > Indeed but where its possible to support it, it can be - and those cards > will be specified by those who need it (for HA, VRRP etc). > > > different times on different hardware. Also, chip bugs lead one to think > > unicast filtering will work where it doesn't. Also, chip limits for some > > of the more popular chips are unknown. Also, the need for this feature > > is very uncommon, and can be achieved in other ways. > > As I said - promiscuous mode and filtering on the receive side - which is > what you have to resort to anyway for those cards that don't support it. > > Alternatively, its just another patch people need to add to make things do > what they want - just like the ARP patch. > > Rick > > From ebiederm@xmission.com Sun Aug 3 12:24:39 2003 Received: with ECARTIS (v1.0.0; list netdev); Sun, 03 Aug 2003 12:24:51 -0700 (PDT) Received: from frodo.biederman.org (ebiederm.dsl.xmission.com [166.70.28.69]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h73JOcFl015227 for ; Sun, 3 Aug 2003 12:24:39 -0700 Received: (from eric@localhost) by frodo.biederman.org (8.9.3/8.9.3) id NAA26235; Sun, 3 Aug 2003 13:21:09 -0600 To: Werner Almesberger Cc: Jeff Garzik , Nivedita Singhvi , netdev@oss.sgi.com, linux-kernel@vger.kernel.org Subject: Re: TOE brain dump References: <20030802140444.E5798@almesberger.net> <3F2BF5C7.90400@us.ibm.com> <3F2C0C44.6020002@pobox.com> <20030802184901.G5798@almesberger.net> From: ebiederm@xmission.com (Eric W. Biederman) Date: 03 Aug 2003 13:21:09 -0600 In-Reply-To: <20030802184901.G5798@almesberger.net> Message-ID: Lines: 59 User-Agent: Gnus/5.09 (Gnus v5.9.0) Emacs/21.1 MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-archive-position: 4490 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: ebiederm@xmission.com Precedence: bulk X-list: netdev Werner Almesberger writes: > Jeff Garzik wrote: > > jabbering at the same time. TCP is a "one size fits all" solution, but > > it doesn't work well for everyone. > > But then, ten "optimized xxPs" that work well in two different > scenarios each, but not so good in the 98 others, wouldn't be > much fun either. The optimized for low latency cases seem to have a strong market in clusters. And they are currently keeping alive quite a few technologies. Myrinet, Infiniband, Quadric's Elan, etc. Having low latency and switch technologies that scale is quite rare currently. > Another problem of TCP is that it has grown a bit too many > knobs you need to turn before it works over your really fast > really long pipe. (In one of the OLS after dinner speeches, > this was quite appropriately called the "wizard gap".) Does anyone know which knobs to turn to make TCP go fast over Infiniband. (A low latency high bandwidth network?) I get to deal with them on a regular basis... There is one place in low latency communications that I can think of where TCP/IP is not the proper solution. For low latency communication the checksum is at the wrong end of the packet. IB gets this one correct and places the checksum at the tail end of the packet. This allows the packet to start transmitting before the checksum is computed, possibly even having the receive start at the other end before the tail of the packet is transmitted. Would it make any sense to do a low latency variation on TCP that fixes that problem? For the IP header we are fine as the data precedes the checksum. But the problem appears to affect all of the upper level protocols that ride on IP, UDP, TCP, SCTP... > > So, fix the other end of the pipeline too, otherwise this fast network > > stuff is flashly but pointless. If you want to serve up data from disk, > > then start creating PCI cards that have both Serial ATA and ethernet > > connectors on them :) Cut out the middleman of the host CPU and host > > memory bus instead of offloading portions of TCP that do not need to be > > offloaded. > > That's a good point. A hierarchical memory structure can help > here. Moving one end closer to the hardware, and letting it > know (e.g. through sendfile) that also the other end is close > (or can be reached more directly that through some hopelessly > crowded main bus) may help too. On that score it is worth noting that the next generation of peripheral busses (Hypertransport, PCI Express, etc) are all switched. Which means that device to device communication may be more reasonable. Going from a bussed interconnect to a switched interconnect is certainly a dramatic change in infrastructure. How that will affect the tradeoffs I don't know. Eric From lm@bitmover.com Sun Aug 3 12:40:33 2003 Received: with ECARTIS (v1.0.0; list netdev); Sun, 03 Aug 2003 12:40:44 -0700 (PDT) Received: from smtp.bitmover.com (smtp.bitmover.com [192.132.92.12]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h73JeWFl018109 for ; Sun, 3 Aug 2003 12:40:33 -0700 Received: from work.bitmover.com (ipcop.bitmover.com [192.132.92.15]) by smtp.bitmover.com (8.12.9/8.12.9) with ESMTP id h743iem7002500; Sun, 3 Aug 2003 20:44:40 -0700 Received: (from lm@localhost) by work.bitmover.com (8.11.6/8.11.6) id h73JeB108493; Sun, 3 Aug 2003 12:40:11 -0700 Date: Sun, 3 Aug 2003 12:40:11 -0700 From: Larry McVoy To: Erik Andersen , Werner Almesberger , Jeff Garzik , netdev@oss.sgi.com, linux-kernel@vger.kernel.org, Nivedita Singhvi Subject: Re: TOE brain dump Message-ID: <20030803194011.GA8324@work.bitmover.com> Mail-Followup-To: Larry McVoy , Erik Andersen , Werner Almesberger , Jeff Garzik , netdev@oss.sgi.com, linux-kernel@vger.kernel.org, Nivedita Singhvi References: <20030802140444.E5798@almesberger.net> <3F2BF5C7.90400@us.ibm.com> <3F2C0C44.6020002@pobox.com> <20030802184901.G5798@almesberger.net> <3F2CAE61.7070401@pobox.com> <20030803145737.B10280@almesberger.net> <20030803182755.GA16770@codepoet.org> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20030803182755.GA16770@codepoet.org> User-Agent: Mutt/1.4i X-MailScanner-Information: Please contact the ISP for more information X-MailScanner: Found to be clean X-MailScanner-SpamCheck: not spam (whitelisted), SpamAssassin (score=0.5, required 7, AWL, DATE_IN_PAST_06_12) X-archive-position: 4491 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: lm@bitmover.com Precedence: bulk X-list: netdev On Sun, Aug 03, 2003 at 12:27:55PM -0600, Erik Andersen wrote: > On Sun Aug 03, 2003 at 02:57:37PM -0300, Werner Almesberger wrote: > > > There is one interesting TOE solution, that I have yet to see created: > > > run Linux on an embedded processor, on the NIC. > > > > That's basically what I've been talking about all the > > while :-) > > http://www.snapgear.com/pci630.html ipcop plus a new PC for $200 is way higher performance and does more. -- --- Larry McVoy lm at bitmover.com http://www.bitmover.com/lm From david.lang@digitalinsight.com Sun Aug 3 13:15:08 2003 Received: with ECARTIS (v1.0.0; list netdev); Sun, 03 Aug 2003 13:15:18 -0700 (PDT) Received: from warden.diginsite.com (warden-p.diginsite.com [208.29.163.248]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h73KF7Fl020689 for ; Sun, 3 Aug 2003 13:15:07 -0700 Received: from wlvims01.diginsite.com by warden.diginsite.com via smtpd (for oss.SGI.COM [192.48.159.27]) with SMTP; Sun, 3 Aug 2003 13:15:07 -0700 Received: from calexc01.digitalinsight.com ([10.200.0.20]) by wlvims01.digitalinsight.com (Post.Office MTA v3.5.3 release 223 ID# 0-0U10L2S100V35) with ESMTP id com; Sun, 3 Aug 2003 13:14:17 -0700 Received: by calexc01.diginsite.com with Internet Mail Service (5.5.2653.19) id ; Sun, 3 Aug 2003 13:15:01 -0700 Received: from dlang.diginsite.com ([10.201.10.67]) by wlvexc00.digitalinsight.com with SMTP (Microsoft Exchange Internet Mail Service Version 5.5.2656.59) id QF5KMNL2; Sun, 3 Aug 2003 13:14:59 -0700 From: David Lang To: Larry McVoy Cc: Erik Andersen , Werner Almesberger , Jeff Garzik , netdev@oss.sgi.com, linux-kernel@vger.kernel.org, Nivedita Singhvi Date: Sun, 3 Aug 2003 13:13:24 -0700 (PDT) Subject: Re: TOE brain dump In-Reply-To: <20030803194011.GA8324@work.bitmover.com> Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII X-archive-position: 4492 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: david.lang@digitalinsight.com Precedence: bulk X-list: netdev On Sun, 3 Aug 2003, Larry McVoy wrote: > On Sun, Aug 03, 2003 at 12:27:55PM -0600, Erik Andersen wrote: > > On Sun Aug 03, 2003 at 02:57:37PM -0300, Werner Almesberger wrote: > > > > There is one interesting TOE solution, that I have yet to see created: > > > > run Linux on an embedded processor, on the NIC. > > > > > > That's basically what I've been talking about all the > > > while :-) > > > > http://www.snapgear.com/pci630.html > > ipcop plus a new PC for $200 is way higher performance and does more. however I can see situations where you would put multiple cards in one box and there could be an advantage to useing PCI (or PCI-X) for you communications between the different 'nodes' of you routing cluster instead of gig ethernet. if this is the approach that the networking guys really want to encourage how about defining an API that you would be willing to support and you can even implement it and then any card that is produced would be supported from day 1. this interface would not have to cover the configuration of the card (that can be done with userspace tools that talk to the card over the 'network', it just needs to cover the ability to do what is effectivly IP over PCI. Linus has commented that in mahy ways Linux is not designed for any existing CPU, it's designed for a virtual CPU that implements all the features we want and those features that aren't implemented in the chips get emulated as needed (obviously what is actually implemented and the speed of emulation are serious considerations for performance) why doesn't the network team define what they thing the ideal NIC interface would be. I can see three catagories of 'ideal' cards 1. cards that are directly driven by the kernel IP stack (similar to what we support now, but an ideal version) 2. router nodes that have access to main memory (PCI card running linux acting as a router/firewall/VPN to offload the main CPU's) 3. router nodes that don't have access to main memory (things like USB/fibrechannel/infiniband/etc versions of #2, the node can run linux and deal with the outside world, only sending the data that is needed to/from the host) even if nobody makes hardware that supports all the desired features directly having a 'this is the dieal driver' reference should impruve furture drivers by letting them use this as the core and implementing code to simulate the features not in hardware. they claim they need this sort of performance, you say 'not that way do it sanely' why not give them a sane way to do it? David Lang From lm@bitmover.com Sun Aug 3 13:31:02 2003 Received: with ECARTIS (v1.0.0; list netdev); Sun, 03 Aug 2003 13:31:08 -0700 (PDT) Received: from smtp.bitmover.com (smtp.bitmover.com [192.132.92.12]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h73KV1Fl021999 for ; Sun, 3 Aug 2003 13:31:02 -0700 Received: from work.bitmover.com (ipcop.bitmover.com [192.132.92.15]) by smtp.bitmover.com (8.12.9/8.12.9) with ESMTP id h744ZKm7003004; Sun, 3 Aug 2003 21:35:20 -0700 Received: (from lm@localhost) by work.bitmover.com (8.11.6/8.11.6) id h73KUp509118; Sun, 3 Aug 2003 13:30:51 -0700 Date: Sun, 3 Aug 2003 13:30:51 -0700 From: Larry McVoy To: David Lang Cc: Larry McVoy , Erik Andersen , Werner Almesberger , Jeff Garzik , netdev@oss.sgi.com, linux-kernel@vger.kernel.org, Nivedita Singhvi Subject: Re: TOE brain dump Message-ID: <20030803203051.GA9057@work.bitmover.com> Mail-Followup-To: Larry McVoy , David Lang , Larry McVoy , Erik Andersen , Werner Almesberger , Jeff Garzik , netdev@oss.sgi.com, linux-kernel@vger.kernel.org, Nivedita Singhvi References: <20030803194011.GA8324@work.bitmover.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.4i X-MailScanner-Information: Please contact the ISP for more information X-MailScanner: Found to be clean X-MailScanner-SpamCheck: not spam (whitelisted), SpamAssassin (score=0.5, required 7, AWL, DATE_IN_PAST_06_12) X-archive-position: 4493 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: lm@bitmover.com Precedence: bulk X-list: netdev On Sun, Aug 03, 2003 at 01:13:24PM -0700, David Lang wrote: > 2. router nodes that have access to main memory (PCI card running linux > acting as a router/firewall/VPN to offload the main CPU's) I can get an entire machine, memory, disk, > Ghz CPU, case, power supply, cdrom, floppy, onboard enet extra net card for routing, for $250 or less, quantity 1, shipped to my door. Why would I want to spend money on some silly offload card when I can get the whole PC for less than the card? -- --- Larry McVoy lm at bitmover.com http://www.bitmover.com/lm From alan@lxorguk.ukuu.org.uk Sun Aug 3 13:55:26 2003 Received: with ECARTIS (v1.0.0; list netdev); Sun, 03 Aug 2003 13:55:31 -0700 (PDT) Received: from lxorguk.ukuu.org.uk (pc1-cwma1-5-cust4.swan.cable.ntl.com [80.5.120.4]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h73KtOFl023931 for ; Sun, 3 Aug 2003 13:55:25 -0700 Received: from dhcp22.swansea.linux.org.uk (dhcp22.swansea.linux.org.uk [127.0.0.1]) by lxorguk.ukuu.org.uk (8.12.8/8.12.5) with ESMTP id h73KpOC3031925; Sun, 3 Aug 2003 21:51:25 +0100 Received: (from alan@localhost) by dhcp22.swansea.linux.org.uk (8.12.8/8.12.8/Submit) id h73KpMDK031923; Sun, 3 Aug 2003 21:51:22 +0100 X-Authentication-Warning: dhcp22.swansea.linux.org.uk: alan set sender to alan@lxorguk.ukuu.org.uk using -f Subject: Re: TOE brain dump From: Alan Cox To: Werner Almesberger Cc: netdev@oss.sgi.com, Linux Kernel Mailing List In-Reply-To: <20030802191411.H5798@almesberger.net> References: <20030802140444.E5798@almesberger.net> <1059857864.20305.14.camel@dhcp22.swansea.linux.org.uk> <20030802191411.H5798@almesberger.net> Content-Type: text/plain Content-Transfer-Encoding: 7bit Organization: Message-Id: <1059943881.31900.1.camel@dhcp22.swansea.linux.org.uk> Mime-Version: 1.0 X-Mailer: Ximian Evolution 1.2.2 (1.2.2-5) Date: 03 Aug 2003 21:51:21 +0100 X-archive-position: 4494 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: alan@lxorguk.ukuu.org.uk Precedence: bulk X-list: netdev On Sad, 2003-08-02 at 23:14, Werner Almesberger wrote: > That's taking this idea to an extreme, yes. I'd think of > using something as big as an amd64 for this as "too > expensive", but perhaps it's cheap enough in the long run, > compared to some "optimized" design. Volume makes cheap. If you look at software v hardware raid controllers the hardware people are permanently being killed by the low volume of cards. From alan@lxorguk.ukuu.org.uk Sun Aug 3 13:56:17 2003 Received: with ECARTIS (v1.0.0; list netdev); Sun, 03 Aug 2003 13:56:21 -0700 (PDT) Received: from lxorguk.ukuu.org.uk (pc1-cwma1-5-cust4.swan.cable.ntl.com [80.5.120.4]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h73KuGFl024198 for ; Sun, 3 Aug 2003 13:56:16 -0700 Received: from dhcp22.swansea.linux.org.uk (dhcp22.swansea.linux.org.uk [127.0.0.1]) by lxorguk.ukuu.org.uk (8.12.8/8.12.5) with ESMTP id h73KqIC3031940; Sun, 3 Aug 2003 21:52:19 +0100 Received: (from alan@localhost) by dhcp22.swansea.linux.org.uk (8.12.8/8.12.8/Submit) id h73KqEOh031938; Sun, 3 Aug 2003 21:52:14 +0100 X-Authentication-Warning: dhcp22.swansea.linux.org.uk: alan set sender to alan@lxorguk.ukuu.org.uk using -f Subject: Re: TOE brain dump From: Alan Cox To: Ben Greear Cc: Jeff Garzik , Nivedita Singhvi , Werner Almesberger , netdev@oss.sgi.com, Linux Kernel Mailing List In-Reply-To: <3F2C891B.7080004@candelatech.com> References: <20030802140444.E5798@almesberger.net> <3F2BF5C7.90400@us.ibm.com> <3F2C0C44.6020002@pobox.com> <3F2C891B.7080004@candelatech.com> Content-Type: text/plain Content-Transfer-Encoding: 7bit Organization: Message-Id: <1059943933.31901.3.camel@dhcp22.swansea.linux.org.uk> Mime-Version: 1.0 X-Mailer: Ximian Evolution 1.2.2 (1.2.2-5) Date: 03 Aug 2003 21:52:13 +0100 X-archive-position: 4495 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: alan@lxorguk.ukuu.org.uk Precedence: bulk X-list: netdev On Sul, 2003-08-03 at 05:01, Ben Greear wrote: > Jeff Garzik wrote: > > > So, fix the other end of the pipeline too, otherwise this fast network > > stuff is flashly but pointless. If you want to serve up data from disk, > > then start creating PCI cards that have both Serial ATA and ethernet > > connectors on them :) Cut out the middleman of the host CPU and host > > I for one would love to see something like this, and not just Serial ATA.. > but maybe 8x Serial ATA and RAID :) There is a protocol floating around for ATA over ethernet, no TCP layer or nasty latency eating complexities in the middle From hadi@cyberus.ca Sun Aug 3 14:17:06 2003 Received: with ECARTIS (v1.0.0; list netdev); Sun, 03 Aug 2003 14:17:18 -0700 (PDT) Received: from mail.cyberus.ca (mail.cyberus.ca [209.195.118.111]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h73LH5Fl025960 for ; Sun, 3 Aug 2003 14:17:06 -0700 Received: from cpe0030ab124d2f-cm014500000962.cpe.net.cable.rogers.com ([24.103.99.32] helo=[10.0.0.9]) by mail.cyberus.ca with esmtp (Exim 4.12) id 19jPZk-0008ov-00; Sun, 03 Aug 2003 16:35:33 -0400 Subject: Re: TOE brain dump From: jamal Reply-To: hadi@cyberus.ca To: Larry McVoy Cc: Erik Andersen , Werner Almesberger , Jeff Garzik , netdev@oss.sgi.com, linux-kernel@vger.kernel.org, Nivedita Singhvi In-Reply-To: <20030803194011.GA8324@work.bitmover.com> References: <20030802140444.E5798@almesberger.net> <3F2BF5C7.90400@us.ibm.com> <3F2C0C44.6020002@pobox.com> <20030802184901.G5798@almesberger.net> <3F2CAE61.7070401@pobox.com> <20030803145737.B10280@almesberger.net> <20030803182755.GA16770@codepoet.org> <20030803194011.GA8324@work.bitmover.com> Content-Type: text/plain Organization: jamalopolis Message-Id: <1059942894.1103.96.camel@jzny.localdomain> Mime-Version: 1.0 X-Mailer: Ximian Evolution 1.2.2 Date: 03 Aug 2003 16:34:54 -0400 Content-Transfer-Encoding: 7bit X-archive-position: 4496 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: hadi@cyberus.ca Precedence: bulk X-list: netdev On Sun, 2003-08-03 at 15:40, Larry McVoy wrote: > On Sun, Aug 03, 2003 at 12:27:55PM -0600, Erik Andersen wrote: > > On Sun Aug 03, 2003 at 02:57:37PM -0300, Werner Almesberger wrote: > > > > There is one interesting TOE solution, that I have yet to see created: > > > > run Linux on an embedded processor, on the NIC. > > > > > > That's basically what I've been talking about all the > > > while :-) > > > > http://www.snapgear.com/pci630.html > > ipcop plus a new PC for $200 is way higher performance and does more. ;-> Actually this proves that putting the whole stack on the NIC is the wrong way to go ;-> That poor piece of NIC was obsoleted before it was born on pricing alone and not just compute power it was supposed to liberate us from. I think the idea of hierachical memories and computation is certainly interesting. Put a CPU and memory on the NIC but not to do the work that Linux already does. Instead think of the NIC and its memeory + CPU as a L1 data and code cache for TCP processing. The idea posed from Davem is intriguing: The only thing the NIC should do is TCP fast path processing based on cached control data generated from the main CPU stack. Any time the fast path gets violated, the cache gets invalidate and it becomes an exception handling to be handled by the main CPU stack. IMO, the only time this will make sense is when the setup cost (downloading the cache or cookies as Dave calls them) is amortized by the data that follows. For example, may not make sense to worry about a HTTP1.0 flow which has 3-4 packets after the SYNack.Bulk transfers make sense (storage, file serving). I dont remember the Mogul paper details but i think this is what he was implying. cheers, jamal From david.lang@digitalinsight.com Sun Aug 3 14:23:00 2003 Received: with ECARTIS (v1.0.0; list netdev); Sun, 03 Aug 2003 14:23:03 -0700 (PDT) Received: from warden3.diginsite.com (warden3-p.diginsite.com [208.147.64.186]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h73LMxFl026639 for ; Sun, 3 Aug 2003 14:23:00 -0700 Received: from no.name.available by warden3.diginsite.com via smtpd (for oss.SGI.COM [192.48.159.27]) with SMTP; Sun, 3 Aug 2003 14:16:04 -0700 Received: from ata-navgw-how1.anytimeaccess.com ([10.210.80.95]) by ata-mail.anytimeaccess.com (Post.Office MTA v3.5.3 release 223 ID# 0-0U10L2S100V35) with SMTP id com for ; Sun, 3 Aug 2003 14:19:17 -0700 Received: from sacexc01.digitalinsight.com ([10.210.80.155]) by ata-navgw-how1.anytimeaccess.com (NAVIEG 2.1 bld 63) with SMTP id M2003080314134107659 ; Sun, 03 Aug 2003 14:13:41 -0700 Received: by sacexc01.anytimeaccess.com with Internet Mail Service (5.5.2656.59) id ; Sun, 3 Aug 2003 14:22:50 -0700 Received: from dlang.diginsite.com ([10.201.10.67]) by wlvexc00.digitalinsight.com with SMTP (Microsoft Exchange Internet Mail Service Version 5.5.2656.59) id QF5KMN9H; Sun, 3 Aug 2003 14:22:47 -0700 From: David Lang To: Larry McVoy Cc: Erik Andersen , Werner Almesberger, Jeff Garzik , netdev@oss.sgi.com, linux-kernel@vger.kernel.org, Nivedita Singhvi Date: Sun, 3 Aug 2003 14:21:12 -0700 (PDT) Subject: Re: TOE brain dump In-Reply-To: <20030803203051.GA9057@work.bitmover.com> Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII X-archive-position: 4497 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: david.lang@digitalinsight.com Precedence: bulk X-list: netdev On Sun, 3 Aug 2003, Larry McVoy wrote: > On Sun, Aug 03, 2003 at 01:13:24PM -0700, David Lang wrote: > > 2. router nodes that have access to main memory (PCI card running linux > > acting as a router/firewall/VPN to offload the main CPU's) > > I can get an entire machine, memory, disk, > Ghz CPU, case, power supply, > cdrom, floppy, onboard enet extra net card for routing, for $250 or less, > quantity 1, shipped to my door. > > Why would I want to spend money on some silly offload card when I can get > the whole PC for less than the card? you may want to do this for a database box where you want to dedicate your main processing power to the database task, if you use a seperate box you still have to talk to that box over a network, if you have it as a card you can talk to the card much more efficantly then you can talk to the seperate machine. if your 8-way opteron database box is already the bottleneck for your system you will have to spend a LOT of money to get anything that gives you more available processing power, getting a card to offload any processing from the main CPU's can be a win. yes this is somewhat of a niche market, but as you point out adding more and more processors in a SMP model is not the ideal way to go, either from performance or from the cost point of view. on the webserver front there are a lot of companies making a living by selling cards and boxes to offload processing from the main CPU's of the webservers (cards to do gzip compression are a relativly new addition, but cards to do SSL handshakes have been around for a while) used properly these can be a very worthwhile invenstment for high-volume webserver companies. also the cost of an extra box can be considerably higer then just the cost of the hardware. I know of one situation where between Linux OS license fees (redhat advanced server) and security software (intrusion detection, auditing, privilage management, etc) a company is looking at ~$4000 in licensing fees for every box they put in their datacenter (and this is for boxes just running apache, add something like an oracle or J2EE appserver software and the cost goes up even more). at this point the fact that the box only cost $200 doesn't really matter, spending an extra $500 each on 4 boxes to eliminate the need for a 5th is easily worth it. (and this company is re-examining hardwaare raid controllers after having run software raid for years becouse they are realizing that this is requiring them to run more servers due to the load on the CPU's) at the low end you are right, just add another box or add another CPU to an existing box, but there are conditions that make adding specialized cards to offload specific functionality a win (for that matter, even at the low end people routinly offload graphics processing to specialized cards, simply to make their games run faster) David Lang From alan@storlinksemi.com Sun Aug 3 15:02:24 2003 Received: with ECARTIS (v1.0.0; list netdev); Sun, 03 Aug 2003 15:02:33 -0700 (PDT) Received: from smtp011.mail.yahoo.com (smtp011.mail.yahoo.com [216.136.173.31]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h73M2OFl029358 for ; Sun, 3 Aug 2003 15:02:24 -0700 Received: from cpe-66-1-155-95.ca.sprintbbd.net (HELO AlanLap) (alansuntzishih@66.1.155.95 with login) by smtp.mail.vip.sc5.yahoo.com with SMTP; 3 Aug 2003 22:02:23 -0000 From: "Alan Shih" To: "David Lang" Cc: "Ben Greear" , "Jeff Garzik" , "Nivedita Singhvi" , "Werner Almesberger" , , Subject: RE: TOE brain dump Date: Sun, 3 Aug 2003 15:02:09 -0700 Message-ID: MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit X-Priority: 3 (Normal) X-MSMail-Priority: Normal X-Mailer: Microsoft Outlook IMO, Build 9.0.2416 (9.0.2911.0) In-Reply-To: Importance: Normal X-MimeOLE: Produced By Microsoft MimeOLE V6.00.2727.1300 X-archive-position: 4498 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: alan@storlinksemi.com Precedence: bulk X-list: netdev On an embedded system, no processor will be fast enough to compete with direct DMA xfer. So just provide sendfile hooks that allow the kernel to initiate data filling from source to dest then allow TSO to take place. Kernel still needs to take care of the TCP stack. I don't see this as building extensive customization though. Alan -----Original Message----- From: David Lang [mailto:david.lang@digitalinsight.com] Sent: Sunday, August 03, 2003 1:26 AM To: Alan Shih Cc: Ben Greear; Jeff Garzik; Nivedita Singhvi; Werner Almesberger; netdev@oss.sgi.com; linux-kernel@vger.kernel.org Subject: RE: TOE brain dump do you really want the processor on the card to be tunning apache/NFS/Samba/etc ? putting enough linux on the card to act as a router (which would include the netfilter stuff) is one thing. putting the userspace code that interfaces with the outside world for file transfers is something else. if you really want the disk connected to your network card you are just talking a low-end linux box. forget all this stuff about it being on a card and just use a full box (economys of scale will make this cheaper) making a firewall that's a core system with a dozen slave systems attached to it (the network cards) sounds like the type of clustering that Linux has been used for for compute nodes. complicated to setup, but extremely powerful and scalable once configured. if you want more then a router on the card then Alan Cox is right, just add another processor to the system, it's easier and cheaper. David Lang On Sat, 2 Aug 2003, Alan Shih wrote: > Date: Sat, 2 Aug 2003 23:22:52 -0700 > From: Alan Shih > To: Ben Greear , Jeff Garzik > Cc: Nivedita Singhvi , > Werner Almesberger , netdev@oss.sgi.com, > linux-kernel@vger.kernel.org > Subject: RE: TOE brain dump > > A DMA xfer that fills the NIC pipe with IDE source. That's not very hard... > need a lot of bufferring/FIFO though. May require large modification to the > file serving applications? > > Alan > > -----Original Message----- > From: linux-kernel-owner@vger.kernel.org > [mailto:linux-kernel-owner@vger.kernel.org]On Behalf Of Ben Greear > Sent: Saturday, August 02, 2003 9:02 PM > To: Jeff Garzik > Cc: Nivedita Singhvi; Werner Almesberger; netdev@oss.sgi.com; > linux-kernel@vger.kernel.org > Subject: Re: TOE brain dump > > > Jeff Garzik wrote: > > > So, fix the other end of the pipeline too, otherwise this fast network > > stuff is flashly but pointless. If you want to serve up data from disk, > > then start creating PCI cards that have both Serial ATA and ethernet > > connectors on them :) Cut out the middleman of the host CPU and host > > I for one would love to see something like this, and not just Serial ATA.. > but maybe 8x Serial ATA and RAID :) > > Ben > > > -- > Ben Greear > Candela Technologies Inc http://www.candelatech.com > > > - > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ > > - > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ > From lm@bitmover.com Sun Aug 3 16:44:35 2003 Received: with ECARTIS (v1.0.0; list netdev); Sun, 03 Aug 2003 16:44:46 -0700 (PDT) Received: from smtp.bitmover.com (smtp.bitmover.com [192.132.92.12]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h73NiYFl004045 for ; Sun, 3 Aug 2003 16:44:35 -0700 Received: from work.bitmover.com (ipcop.bitmover.com [192.132.92.15]) by smtp.bitmover.com (8.12.9/8.12.9) with ESMTP id h747mnm7005317; Mon, 4 Aug 2003 00:48:49 -0700 Received: (from lm@localhost) by work.bitmover.com (8.11.6/8.11.6) id h73NiJM13637; Sun, 3 Aug 2003 16:44:19 -0700 Date: Sun, 3 Aug 2003 16:44:19 -0700 From: Larry McVoy To: David Lang Cc: Larry McVoy , Erik Andersen , Werner Almesberger , Jeff Garzik , netdev@oss.sgi.com, linux-kernel@vger.kernel.org, Nivedita Singhvi Subject: Re: TOE brain dump Message-ID: <20030803234419.GA13604@work.bitmover.com> Mail-Followup-To: Larry McVoy , David Lang , Larry McVoy , Erik Andersen , Werner Almesberger , Jeff Garzik , netdev@oss.sgi.com, linux-kernel@vger.kernel.org, Nivedita Singhvi References: <20030803203051.GA9057@work.bitmover.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.4i X-MailScanner-Information: Please contact the ISP for more information X-MailScanner: Found to be clean X-MailScanner-SpamCheck: not spam (whitelisted), SpamAssassin (score=0.5, required 7, AWL, DATE_IN_PAST_06_12) X-archive-position: 4499 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: lm@bitmover.com Precedence: bulk X-list: netdev On Sun, Aug 03, 2003 at 02:21:12PM -0700, David Lang wrote: > if your 8-way opteron database box is already the bottleneck for your > system you will have to spend a LOT of money to get anything that gives > you more available processing power, getting a card to offload any > processing from the main CPU's can be a win. I'd like to see data which supports this. CPUs have gotten so fast and disk I/O still sucks. All the systems I've seen are CPU rich and I/O starved. The smartest thing you could do would be to get a cheap box with a GB of ram as a disk cache and make it be a SAN device. Make N of those and you have tons of disk space and tons of cache and your 8 way opteron database box is going to work just fine. -- --- Larry McVoy lm at bitmover.com http://www.bitmover.com/lm From david-b@pacbell.net Sun Aug 3 20:06:00 2003 Received: with ECARTIS (v1.0.0; list netdev); Sun, 03 Aug 2003 20:06:37 -0700 (PDT) Received: from mta4.rcsntx.swbell.net (mta4.rcsntx.swbell.net [151.164.30.28]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h7435xFl018707 for ; Sun, 3 Aug 2003 20:06:00 -0700 Received: from pacbell.net (ppp-67-118-247-123.dialup.pltn13.pacbell.net [67.118.247.123]) by mta4.rcsntx.swbell.net (8.12.9/8.12.3) with ESMTP id h7435gjA011136; Sun, 3 Aug 2003 22:05:43 -0500 (CDT) Message-ID: <3F2DCE56.6030601@pacbell.net> Date: Sun, 03 Aug 2003 20:09:10 -0700 From: David Brownell User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.2.1) Gecko/20030225 X-Accept-Language: en-us, en, fr MIME-Version: 1.0 To: "David S. Miller" CC: Ben Greear , jgarzik@pobox.com, scott.feldman@intel.com, netdev@oss.sgi.com Subject: Re: e100 "Ferguson" release References: <3F2CA65F.8060105@pobox.com> <3F2CBA71.2070503@candelatech.com> <20030803003239.4257ef24.davem@redhat.com> In-Reply-To: <20030803003239.4257ef24.davem@redhat.com> Content-Type: text/plain; charset=us-ascii; format=flowed Content-Transfer-Encoding: 7bit X-archive-position: 4500 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: david-b@pacbell.net Precedence: bulk X-list: netdev David S. Miller wrote: >>Although I have not tried this latest patch, the existing e100 and e1000 in >>2.4.21 seldom seem to return true to this method: netif_queue_stopped(odev), >>even when the next hard_start_xmit() call fails. > > > Returning an error from hard_start_xmit() from normal ethernet > drivers is, frankly, a driver bug and should never happen. What's "normal" mean? With the current USB stack, network adapters tend to need memory allocations there. Those can fail, though it seems that's not very common. Doesn't seem like a bug, for all that I'd rather see the those paths be zero-alloc in 2.7. - Dave From davem@redhat.com Sun Aug 3 20:13:14 2003 Received: with ECARTIS (v1.0.0; list netdev); Sun, 03 Aug 2003 20:13:19 -0700 (PDT) Received: from pizda.ninka.net (IDENT:root@pizda.ninka.net [216.101.162.242]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h743DDFl019496 for ; Sun, 3 Aug 2003 20:13:14 -0700 Received: from pizda.ninka.net (IDENT:davem@localhost.localdomain [127.0.0.1]) by pizda.ninka.net (8.9.3/8.9.3) with SMTP id UAA18891; Sun, 3 Aug 2003 20:08:51 -0700 Date: Sun, 3 Aug 2003 20:08:51 -0700 From: "David S. Miller" To: David Brownell Cc: greearb@candelatech.com, jgarzik@pobox.com, scott.feldman@intel.com, netdev@oss.sgi.com Subject: Re: e100 "Ferguson" release Message-Id: <20030803200851.7d46a605.davem@redhat.com> In-Reply-To: <3F2DCE56.6030601@pacbell.net> References: <3F2CA65F.8060105@pobox.com> <3F2CBA71.2070503@candelatech.com> <20030803003239.4257ef24.davem@redhat.com> <3F2DCE56.6030601@pacbell.net> X-Mailer: Sylpheed version 0.9.2 (GTK+ 1.2.6; sparc-unknown-linux-gnu) Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit X-archive-position: 4501 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: davem@redhat.com Precedence: bulk X-list: netdev On Sun, 03 Aug 2003 20:09:10 -0700 David Brownell wrote: > David S. Miller wrote: > >>Although I have not tried this latest patch, the existing e100 and e1000 in > >>2.4.21 seldom seem to return true to this method: netif_queue_stopped(odev), > >>even when the next hard_start_xmit() call fails. > > > > > > Returning an error from hard_start_xmit() from normal ethernet > > drivers is, frankly, a driver bug and should never happen. > > What's "normal" mean? One that can manage it's own TX resources. > With the current USB stack, network adapters tend to need > memory allocations there. Those can fail, though it seems > that's not very common. Doesn't seem like a bug, for all > that I'd rather see the those paths be zero-alloc in 2.7. Any particular reason why the SKB data itself can't be mapped directly? We created all of these DMA mapping abstractions remember? :-) Another option is to pre-allocate, such that while the TX queue is awake we know we have enough resources to send any given packet. Then in ->hard_start_xmit() after using a buffer we allocate a replacement buffer, if this fails in such a way that a subsequent ->hard_start_xmit() could possibly fail, we do netif_stop_queue(). Look to tg3 to see what I'm talking about in general. netif_stop_queue() is done at the moment at which it may be possible that we cannot accept the queueing of a TX packet. This means that when TX entries available <= MAX_SKB_FRAGS + 1, we stop the queue. This guarentees that we will always be able to handle any packet given to us via ->hard_start_xmit(). From david-b@pacbell.net Sun Aug 3 20:41:38 2003 Received: with ECARTIS (v1.0.0; list netdev); Sun, 03 Aug 2003 20:41:43 -0700 (PDT) Received: from mta4.rcsntx.swbell.net (mta4.rcsntx.swbell.net [151.164.30.28]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h743fcFl021918 for ; Sun, 3 Aug 2003 20:41:38 -0700 Received: from pacbell.net (ppp-67-118-247-123.dialup.pltn13.pacbell.net [67.118.247.123]) by mta4.rcsntx.swbell.net (8.12.9/8.12.3) with ESMTP id h743fXjA026551; Sun, 3 Aug 2003 22:41:33 -0500 (CDT) Message-ID: <3F2DD6BD.7070504@pacbell.net> Date: Sun, 03 Aug 2003 20:45:01 -0700 From: David Brownell User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.2.1) Gecko/20030225 X-Accept-Language: en-us, en, fr MIME-Version: 1.0 To: "David S. Miller" CC: greearb@candelatech.com, jgarzik@pobox.com, scott.feldman@intel.com, netdev@oss.sgi.com Subject: Re: e100 "Ferguson" release References: <3F2CA65F.8060105@pobox.com> <3F2CBA71.2070503@candelatech.com> <20030803003239.4257ef24.davem@redhat.com> <3F2DCE56.6030601@pacbell.net> <20030803200851.7d46a605.davem@redhat.com> In-Reply-To: <20030803200851.7d46a605.davem@redhat.com> Content-Type: text/plain; charset=us-ascii; format=flowed Content-Transfer-Encoding: 7bit X-archive-position: 4502 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: david-b@pacbell.net Precedence: bulk X-list: netdev >>>>Although I have not tried this latest patch, the existing e100 and e1000 in >>>>2.4.21 seldom seem to return true to this method: netif_queue_stopped(odev), >>>>even when the next hard_start_xmit() call fails. >>> >>> >>>Returning an error from hard_start_xmit() from normal ethernet >>>drivers is, frankly, a driver bug and should never happen. >> >>What's "normal" mean? > > > One that can manage it's own TX resources. Which for the moment, would seem to exclude USB. >>With the current USB stack, network adapters tend to need >>memory allocations there. Those can fail, though it seems >>that's not very common. Doesn't seem like a bug, for all >>that I'd rather see the those paths be zero-alloc in 2.7. > > > Any particular reason why the SKB data itself can't be > mapped directly? We created all of these DMA mapping > abstractions remember? :-) Yes, but the network drivers weren't the ones that needed them! Some older drivers do copy the buffer out of (or for rx, into) the skb, but newer ones just pass the skb data, avoiding a copy. In either case, the buffer was always DMA mapped. Nowadays, some drivers will even set NETIF_F_HIGHDMA if they're going out through a host controller that allows it! (Intel boxes only, AFAIK.) > Another option is to pre-allocate, such that while the TX > queue is awake we know we have enough resources to send any > given packet. Then in ->hard_start_xmit() after using a buffer > we allocate a replacement buffer, if this fails in such a way > that a subsequent ->hard_start_xmit() could possibly fail, we > do netif_stop_queue(). Pre-allocation can be done for the URBs that wrap the data buffers, yes. Not often done today; but it could be. What can't be pre-allocated in a reliable way is the resources used by the host controller drivers, specifically the transfer descriptors. EHCI and OHCI usually need one per URB, unless MTU is over 4 KB. UHCI normally needs quite a few. The API that works inside USB "gadgets' does allow pre-allocation at all those levels, mostly because it's factored to make the submission and completion paths fast. So that "stop if can't pre-allocate" scheme would work, given an "ether.c" patch! :) - Dave From davem@redhat.com Sun Aug 3 20:51:30 2003 Received: with ECARTIS (v1.0.0; list netdev); Sun, 03 Aug 2003 20:51:34 -0700 (PDT) Received: from pizda.ninka.net (IDENT:root@pizda.ninka.net [216.101.162.242]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h743pUFl022879 for ; Sun, 3 Aug 2003 20:51:30 -0700 Received: from pizda.ninka.net (IDENT:davem@localhost.localdomain [127.0.0.1]) by pizda.ninka.net (8.9.3/8.9.3) with SMTP id UAA19000; Sun, 3 Aug 2003 20:46:42 -0700 Date: Sun, 3 Aug 2003 20:46:42 -0700 From: "David S. Miller" To: David Brownell Cc: greearb@candelatech.com, jgarzik@pobox.com, scott.feldman@intel.com, netdev@oss.sgi.com Subject: Re: e100 "Ferguson" release Message-Id: <20030803204642.684c6075.davem@redhat.com> In-Reply-To: <3F2DD6BD.7070504@pacbell.net> References: <3F2CA65F.8060105@pobox.com> <3F2CBA71.2070503@candelatech.com> <20030803003239.4257ef24.davem@redhat.com> <3F2DCE56.6030601@pacbell.net> <20030803200851.7d46a605.davem@redhat.com> <3F2DD6BD.7070504@pacbell.net> X-Mailer: Sylpheed version 0.9.2 (GTK+ 1.2.6; sparc-unknown-linux-gnu) Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit X-archive-position: 4503 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: davem@redhat.com Precedence: bulk X-list: netdev On Sun, 03 Aug 2003 20:45:01 -0700 David Brownell wrote: > What can't be pre-allocated in a reliable way is the resources > used by the host controller drivers, specifically the transfer > descriptors. EHCI and OHCI usually need one per URB, unless > MTU is over 4 KB. UHCI normally needs quite a few. Ok, that's interesting. Is there a callback that tells the USB driver that some host controller "resources" have become available? I mean, these host controllers either have to queue requests when out of resources or provide a callback so that the drivers can resubmit. Right? From david-b@pacbell.net Sun Aug 3 21:05:09 2003 Received: with ECARTIS (v1.0.0; list netdev); Sun, 03 Aug 2003 21:05:15 -0700 (PDT) Received: from mta4.rcsntx.swbell.net (mta4.rcsntx.swbell.net [151.164.30.28]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h74458Fl024201 for ; Sun, 3 Aug 2003 21:05:09 -0700 Received: from pacbell.net (ppp-67-118-247-123.dialup.pltn13.pacbell.net [67.118.247.123]) by mta4.rcsntx.swbell.net (8.12.9/8.12.3) with ESMTP id h7444wjA026527; Sun, 3 Aug 2003 23:05:04 -0500 (CDT) Message-ID: <3F2DDC3A.2040707@pacbell.net> Date: Sun, 03 Aug 2003 21:08:26 -0700 From: David Brownell User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.2.1) Gecko/20030225 X-Accept-Language: en-us, en, fr MIME-Version: 1.0 To: "David S. Miller" CC: greearb@candelatech.com, jgarzik@pobox.com, scott.feldman@intel.com, netdev@oss.sgi.com Subject: Re: e100 "Ferguson" release References: <3F2CA65F.8060105@pobox.com> <3F2CBA71.2070503@candelatech.com> <20030803003239.4257ef24.davem@redhat.com> <3F2DCE56.6030601@pacbell.net> <20030803200851.7d46a605.davem@redhat.com> <3F2DD6BD.7070504@pacbell.net> <20030803204642.684c6075.davem@redhat.com> In-Reply-To: <20030803204642.684c6075.davem@redhat.com> Content-Type: text/plain; charset=us-ascii; format=flowed Content-Transfer-Encoding: 7bit X-archive-position: 4504 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: david-b@pacbell.net Precedence: bulk X-list: netdev David S. Miller wrote: > On Sun, 03 Aug 2003 20:45:01 -0700 > David Brownell wrote: > > >>What can't be pre-allocated in a reliable way is the resources >>used by the host controller drivers, specifically the transfer >>descriptors. EHCI and OHCI usually need one per URB, unless >>MTU is over 4 KB. UHCI normally needs quite a few. > > > Ok, that's interesting. All TDs get allocated in usb_submit_urb(), which is the first time the "real" core of USB connects an urb with an I/O queue. That's host-side, not device-side. > Is there a callback that tells the USB driver that some host > controller "resources" have become available? I mean, these host > controllers either have to queue requests when out of resources or > provide a callback so that the drivers can resubmit. No such callback. If no resources, they fail -ENOMEM and the caller must recover. Which is why hard_start_xmit() needs to do something. - Dave From davem@redhat.com Sun Aug 3 21:17:52 2003 Received: with ECARTIS (v1.0.0; list netdev); Sun, 03 Aug 2003 21:18:01 -0700 (PDT) Received: from pizda.ninka.net (IDENT:root@pizda.ninka.net [216.101.162.242]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h744HqFl025423 for ; Sun, 3 Aug 2003 21:17:52 -0700 Received: from pizda.ninka.net (IDENT:davem@localhost.localdomain [127.0.0.1]) by pizda.ninka.net (8.9.3/8.9.3) with SMTP id VAA19041; Sun, 3 Aug 2003 21:13:34 -0700 Date: Sun, 3 Aug 2003 21:13:33 -0700 From: "David S. Miller" To: David Brownell Cc: greearb@candelatech.com, jgarzik@pobox.com, scott.feldman@intel.com, netdev@oss.sgi.com Subject: Re: e100 "Ferguson" release Message-Id: <20030803211333.12839f66.davem@redhat.com> In-Reply-To: <3F2DDC3A.2040707@pacbell.net> References: <3F2CA65F.8060105@pobox.com> <3F2CBA71.2070503@candelatech.com> <20030803003239.4257ef24.davem@redhat.com> <3F2DCE56.6030601@pacbell.net> <20030803200851.7d46a605.davem@redhat.com> <3F2DD6BD.7070504@pacbell.net> <20030803204642.684c6075.davem@redhat.com> <3F2DDC3A.2040707@pacbell.net> X-Mailer: Sylpheed version 0.9.2 (GTK+ 1.2.6; sparc-unknown-linux-gnu) Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit X-archive-position: 4505 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: davem@redhat.com Precedence: bulk X-list: netdev On Sun, 03 Aug 2003 21:08:26 -0700 David Brownell wrote: > No such callback. If no resources, they fail -ENOMEM and the > caller must recover. Which is why hard_start_xmit() needs to > do something. I would suggest something different :-) For example, what do USB block device drivers do when -ENOMEM comes back? Do they just drop the request on the floor? No, rather they resubmit the request later without the scsi/block layer knowing anything about what happened, right? How do the USB block device drivers know when "later" is? This is why I can't believe there is not some kind of "some USB resources have been freed" event of some sort which USB drivers can use to deal with this. :-) From davem@redhat.com Sun Aug 3 22:26:02 2003 Received: with ECARTIS (v1.0.0; list netdev); Sun, 03 Aug 2003 22:26:09 -0700 (PDT) Received: from rth.ninka.net (rth.ninka.net [216.101.162.244]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h745Q1Fl030678 for ; Sun, 3 Aug 2003 22:26:02 -0700 Received: from rth.ninka.net (localhost.localdomain [127.0.0.1]) by rth.ninka.net (8.12.8/8.12.8) with SMTP id h745PsSG027235; Sun, 3 Aug 2003 22:25:55 -0700 Date: Sun, 3 Aug 2003 22:25:54 -0700 From: "David S. Miller" To: Glen Turner Cc: jgarzik@pobox.com, netdev@oss.sgi.com Subject: Re: TOE brain dump Message-Id: <20030803222554.7027a160.davem@redhat.com> In-Reply-To: <3F2DBB2B.9050803@aarnet.edu.au> References: <20030802140444.E5798@almesberger.net> <3F2BF5C7.90400@us.ibm.com> <3F2C0C44.6020002@pobox.com> <20030802184901.G5798@almesberger.net> <3F2CAE61.7070401@pobox.com> <3F2DBB2B.9050803@aarnet.edu.au> X-Mailer: Sylpheed version 0.9.2 (GTK+ 1.2.10; i686-pc-linux-gnu) Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit X-archive-position: 4506 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: davem@redhat.com Precedence: bulk X-list: netdev [ This discussion belongs on netdev, not linux-kernel. ] On Mon, 04 Aug 2003 11:17:23 +0930 Glen Turner wrote: > That's Matt Mathis's phrase. The Web100 project > has a set of patches to the kernel > which go a long way to reducing the wizard gap. It would be > nice to see those patches eventually appear in the Linux > mainstream. The web100 patches aren't in the kernel because 1) they've never even been submitted and 2) they need a large cleanup. I sort of get the impression that the web100 folks actually like that their changes are not in the main sources, it keeps their work "special". From werner@almesberger.net Sun Aug 3 22:51:33 2003 Received: with ECARTIS (v1.0.0; list netdev); Sun, 03 Aug 2003 22:51:44 -0700 (PDT) Received: from host.almesberger.net (almesberger.net [63.105.73.239] (may be forged)) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h745pWFl000336 for ; Sun, 3 Aug 2003 22:51:32 -0700 Received: from almesberger.net (vpnwa-home [10.200.0.2]) by host.almesberger.net (8.11.6/8.9.3) with ESMTP id h73HvkG04076; Sun, 3 Aug 2003 10:57:50 -0700 Received: (from werner@localhost) by almesberger.net (8.11.6/8.11.6) id h73Hvc310464; Sun, 3 Aug 2003 14:57:38 -0300 Date: Sun, 3 Aug 2003 14:57:37 -0300 From: Werner Almesberger To: Jeff Garzik Cc: netdev@oss.sgi.com, linux-kernel@vger.kernel.org, Nivedita Singhvi Subject: Re: TOE brain dump Message-ID: <20030803145737.B10280@almesberger.net> References: <20030802140444.E5798@almesberger.net> <3F2BF5C7.90400@us.ibm.com> <3F2C0C44.6020002@pobox.com> <20030802184901.G5798@almesberger.net> <3F2CAE61.7070401@pobox.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <3F2CAE61.7070401@pobox.com>; from jgarzik@pobox.com on Sun, Aug 03, 2003 at 02:40:33AM -0400 X-archive-position: 4507 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: werner@almesberger.net Precedence: bulk X-list: netdev Jeff Garzik wrote: > Really fast, really long pipes in practice don't exist for 99.9% of all > Internet users. It matters to some right now, i.e. the ones who are interested in TOE in the first place. (And there also those who try to tweak TCP to actually work over such links. Right now, its congestion control doesn't scale that well.) Also, IT has been good at making all that elitarian high-performance gear available to the common people rather quickly, and I don't see that changing. The Crisis just alters the pace a little. > When you approach traffic levels that push you want to offload most of > the TCP net stack, then TCP isn't the right solution for you anymore, > all things considered. No. Ironically, TCP is almost always the right solution. Sometimes people try to use something else. Eventually, their protocol wants to go over WANs or something that looks suspiciously like a WAN (MAN or such). At that point, they usually realize that TCP provides exactly the functionality they need. In theory, one could implement the same functionality in other protocols. There was even talk at IETF to support a generic congestion control manager for this purpose. That was many years ago, and I haven't seen anything come out of this. So it seems that, by the time your protocol grows up to want to play in the real world, it wants to be so much like TCP that you're better off using TCP. The amusing bit here is to watch all the "competitors" pop up, grow, fail, and eventually die. > The Linux net stack just isn't built to be offloaded. Yes ! And that's not a flaw of the stack, but it's simply a fact of life. I think that no "real life" stack can be offloaded (in the traditional sense). > And I can't see ASIC and firmware > designers being excited about implementing netfilter on a PCI card :) And when they're done with netfilter, you can throw IPsec, IPv6, or traffic control at them. Eventually, you'll wear them down ;-) > Unfortunately some vendors seem to choosing TOE option #3: TCP offload > which introduces many limitations (connection limits, netfilter not > supported, etc.) which Linux never had before. That's when that little word "no" comes into play, i.e. when their modifications to the stack show up on netdev or linux-kernel. Dave Miller seems to be pretty good at saying "no". I hope he keeps on being good at this ;-) > There is one interesting TOE solution, that I have yet to see created: > run Linux on an embedded processor, on the NIC. That's basically what I've been talking about all the while :-) > The Linux OS driver interface becomes a virtual interface > with a large MTU, Probably not. I think you also want to push some knowledge of where the data ultimately goes to the NIC. This could be something like sendfile, something new, or just a few bytes of user space code. - Werner -- _________________________________________________________________________ / Werner Almesberger, Buenos Aires, Argentina werner@almesberger.net / /_http://www.almesberger.net/____________________________________________/ From jgarzik@pobox.com Sun Aug 3 23:00:29 2003 Received: with ECARTIS (v1.0.0; list netdev); Sun, 03 Aug 2003 23:00:34 -0700 (PDT) Received: from www.linux.org.uk (IDENT:LyFAj4hc+YOjEDA9RBFVQujz5PCEzoP5@parcelfarce.linux.theplanet.co.uk [195.92.249.252]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h7460RFl001286 for ; Sun, 3 Aug 2003 23:00:28 -0700 Received: from rdu26-227-011.nc.rr.com ([66.26.227.11] helo=pobox.com) by www.linux.org.uk with esmtp (Exim 4.14) id 19jQrk-00026Y-PG; Sun, 03 Aug 2003 22:58:12 +0100 Message-ID: <3F2D8569.1010109@pobox.com> Date: Sun, 03 Aug 2003 17:58:01 -0400 From: Jeff Garzik Organization: none User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.2.1) Gecko/20021213 Debian/1.2.1-2.bunk X-Accept-Language: en MIME-Version: 1.0 To: Larry McVoy CC: David Lang , Erik Andersen , Werner Almesberger , netdev@oss.sgi.com, linux-kernel@vger.kernel.org, Nivedita Singhvi Subject: Re: TOE brain dump References: <20030803194011.GA8324@work.bitmover.com> <20030803203051.GA9057@work.bitmover.com> In-Reply-To: <20030803203051.GA9057@work.bitmover.com> Content-Type: text/plain; charset=us-ascii; format=flowed Content-Transfer-Encoding: 7bit X-archive-position: 4508 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: jgarzik@pobox.com Precedence: bulk X-list: netdev Larry McVoy wrote: > I can get an entire machine, memory, disk, > Ghz CPU, case, power supply, > cdrom, floppy, onboard enet extra net card for routing, for $250 or less, > quantity 1, shipped to my door. > > Why would I want to spend money on some silly offload card when I can get > the whole PC for less than the card? Yep. I think we are entering the era of what I call RAIC (pronounced "rake") -- redundant array of inexpensive computers. For organizations that can handle the space/power/temperature load, a powerful cluster of supercheap PCs, the "Wal-Mart Supercomputer", can be built for a rock-bottom price. From pekkas@netcore.fi Sun Aug 3 23:06:13 2003 Received: with ECARTIS (v1.0.0; list netdev); Sun, 03 Aug 2003 23:06:16 -0700 (PDT) Received: from netcore.fi (netcore.fi [193.94.160.1]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h7466BFl002029 for ; Sun, 3 Aug 2003 23:06:12 -0700 Received: from localhost (pekkas@localhost) by netcore.fi (8.11.6/8.11.6) with ESMTP id h74664b12177 for ; Mon, 4 Aug 2003 09:06:05 +0300 Date: Mon, 4 Aug 2003 09:06:04 +0300 (EEST) From: Pekka Savola To: netdev@oss.sgi.com Subject: multicast IP datagram forwarding bug and fix (fwd) Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII X-archive-position: 4509 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: pekkas@netcore.fi Precedence: bulk X-list: netdev I didn't see followups to this, so I'm re-sending to the list just in case it got dropped in the cracks.. -- Pekka Savola "You each name yourselves king, yet the Netcore Oy kingdom bleeds." Systems. Networks. Security. -- George R.R. Martin: A Clash of Kings ---------- Forwarded message ---------- Date: Mon, 28 Jul 2003 13:20:31 -0400 From: "Weng, Wending" To: netdev@oss.sgi.com Subject: multicast IP datagram forwarding bug and fix > Hi, > > LINUX doesn't forward multicast IP datagram if it has option(s), there is is a bug in the module ipmr.c, function > ipmr_forward_finish, below is the current version of this function: > > static inline int ipmr_forward_finish(struct sk_buff *skb) > { > struct dst_entry *dst = skb->dst; > > if (skb->len <= dst->pmtu) > return dst->output(skb); > else > return ip_fragment(skb, dst->output); > } > > it forgets to recalculate the checksum in case the option is modified. > > The following code works properly: > > static inline int ipmr_forward_finish(struct sk_buff *skb) > { > struct dst_entry *dst = skb->dst; > > ip_forward_options (skb); /* this line recalculates checksum if needed. */ > > if (skb->len <= dst->pmtu) > return dst->output(skb); > else > return ip_fragment(skb, dst->output); > } > > Wending Weng From davem@redhat.com Sun Aug 3 23:10:14 2003 Received: with ECARTIS (v1.0.0; list netdev); Sun, 03 Aug 2003 23:10:20 -0700 (PDT) Received: from pizda.ninka.net (IDENT:root@pizda.ninka.net [216.101.162.242]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h746AAFl002620 for ; Sun, 3 Aug 2003 23:10:13 -0700 Received: from pizda.ninka.net (IDENT:davem@localhost.localdomain [127.0.0.1]) by pizda.ninka.net (8.9.3/8.9.3) with SMTP id XAA19251; Sun, 3 Aug 2003 23:05:52 -0700 Date: Sun, 3 Aug 2003 23:05:52 -0700 From: "David S. Miller" To: Pekka Savola Cc: netdev@oss.sgi.com Subject: Re: multicast IP datagram forwarding bug and fix (fwd) Message-Id: <20030803230552.1aab9411.davem@redhat.com> In-Reply-To: References: X-Mailer: Sylpheed version 0.9.2 (GTK+ 1.2.6; sparc-unknown-linux-gnu) Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit X-archive-position: 4510 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: davem@redhat.com Precedence: bulk X-list: netdev On Mon, 4 Aug 2003 09:06:04 +0300 (EEST) Pekka Savola wrote: > I didn't see followups to this, so I'm re-sending to the list just in case > it got dropped in the cracks.. I've already checked in a correct fix for this problem from Alexey: # This is a BitKeeper generated patch for the following project: # Project Name: Linux kernel tree # This patch format is intended for GNU patch command version 2.5 or higher. # This patch includes the following deltas: # ChangeSet 1.1584.2.13 -> 1.1584.2.14 # net/ipv4/ipmr.c 1.27 -> 1.28 # # The following is the BitKeeper ChangeSet Log # -------------------------------------------- # 03/08/02 kuznet@ms2.inr.ac.ru 1.1584.2.14 # [IPV4]: IP options were not updated while forwarding multicasts. # -------------------------------------------- # diff -Nru a/net/ipv4/ipmr.c b/net/ipv4/ipmr.c --- a/net/ipv4/ipmr.c Sun Aug 3 23:07:44 2003 +++ b/net/ipv4/ipmr.c Sun Aug 3 23:07:44 2003 @@ -1100,6 +1100,7 @@ skb->h.ipiph = skb->nh.iph; skb->nh.iph = iph; + memset(&(IPCB(skb)->opt), 0, sizeof(IPCB(skb)->opt)); #ifdef CONFIG_NETFILTER nf_conntrack_put(skb->nfct); skb->nfct = NULL; @@ -1108,12 +1109,14 @@ static inline int ipmr_forward_finish(struct sk_buff *skb) { - struct dst_entry *dst = skb->dst; + struct ip_options * opt = &(IPCB(skb)->opt); - if (skb->len <= dst_pmtu(dst)) - return dst_output(skb); - else - return ip_fragment(skb, dst_output); + IP_INC_STATS_BH(IpForwDatagrams); + + if (unlikely(opt->optlen)) + ip_forward_options(skb); + + return dst_output(skb); } /* From pekkas@netcore.fi Sun Aug 3 23:11:39 2003 Received: with ECARTIS (v1.0.0; list netdev); Sun, 03 Aug 2003 23:11:44 -0700 (PDT) Received: from netcore.fi (netcore.fi [193.94.160.1]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h746BbFl003012 for ; Sun, 3 Aug 2003 23:11:38 -0700 Received: from localhost (pekkas@localhost) by netcore.fi (8.11.6/8.11.6) with ESMTP id h746AsA12243; Mon, 4 Aug 2003 09:11:00 +0300 Date: Mon, 4 Aug 2003 09:10:53 +0300 (EEST) From: Pekka Savola To: Lamont Granquist cc: Bill Davidsen , "David S. Miller" , Carlos Velasco , , , , , , , Subject: Re: [2.4 PATCH] bugfix: ARP respond on all devices In-Reply-To: <20030728213933.F81299@coredump.scriptkiddie.org> Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII X-archive-position: 4511 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: pekkas@netcore.fi Precedence: bulk X-list: netdev Hi, Just a thought.. How about consider this change for 2.6 kernel series at this point, and don't backport it 2.4 at least first and/or make the behaviour configurable? Upgrading from 2.4 to 2.6 should be a step big enough that folks should revisit their more advanced configurations, causing smaller surprises. Changing the behaviour inside 2.4.x series might not be reasonable. On Mon, 28 Jul 2003, Lamont Granquist wrote: > On Mon, 28 Jul 2003, Bill Davidsen wrote: > > On Sun, 27 Jul 2003, David S. Miller wrote: > > > This particular case has been discussed to death in the past > > > and I really recommend people read up there before dragging this > > > out further. > > > > It will keep coming back because it's a real problem. I do agree that the > > hidden patch is not the desired way to solve the problem, but until there > > is a reasonable (not requiring a guru or large manual effort) solution > > people will keep bringing it up. > > And it severely violates the principle of least surprise. Its unfortunate > that this principle isn't more widely discussed and considered on lkml. > -- Pekka Savola "You each name yourselves king, yet the Netcore Oy kingdom bleeds." Systems. Networks. Security. -- George R.R. Martin: A Clash of Kings From andi@averellmail.firstfloor.org Mon Aug 4 05:50:32 2003 Received: with ECARTIS (v1.0.0; list netdev); Mon, 04 Aug 2003 05:50:45 -0700 (PDT) Received: from zero.aec.at (Bishop.Potter@zero.aec.at [193.170.194.10]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h74CoSFl017874 for ; Mon, 4 Aug 2003 05:50:30 -0700 Received: from fred.muc.de (Jared.Oopf@localhost.localdomain [127.0.0.1]) by zero.aec.at (8.11.6/8.11.2) with ESMTP id h74CoLm04438 for ; Mon, 4 Aug 2003 14:50:21 +0200 Received: by fred.muc.de (Postfix on SuSE Linux 7.3 (i386), from userid 500) id C18D35BB86; Mon, 4 Aug 2003 14:50:22 +0200 (CEST) Date: Mon, 4 Aug 2003 14:50:22 +0200 From: Andi Kleen To: netdev@oss.sgi.com Subject: [PATCH] Make XFRM optional Message-ID: <20030804125022.GA8167@averell> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.4i X-archive-position: 4512 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: ak@muc.de Precedence: bulk X-list: netdev Only compile in the xfrm subsystem when it's needed by any config options. This avoids some code/data structure bloat in case you don't use IP tunneling or IPsec. Also adds a net_ratelimit() to a unprotected printk. For 2.6.0test2 -Andi diff -u linux-work/include/net/dst.h-XFRM linux-work/include/net/dst.h --- linux-work/include/net/dst.h-XFRM 2003-07-18 02:40:02.000000000 +0200 +++ linux-work/include/net/dst.h 2003-08-03 23:12:24.000000000 +0200 @@ -247,8 +247,16 @@ extern void dst_init(void); struct flowi; +#ifndef CONFIG_XFRM +static inline int xfrm_lookup(struct dst_entry **dst_p, struct flowi *fl, + struct sock *sk, int flags) +{ + return 0; +} +#else extern int xfrm_lookup(struct dst_entry **dst_p, struct flowi *fl, struct sock *sk, int flags); #endif +#endif #endif /* _NET_DST_H */ diff -u linux-work/include/net/xfrm.h-XFRM linux-work/include/net/xfrm.h --- linux-work/include/net/xfrm.h-XFRM 2003-07-28 23:12:30.000000000 +0200 +++ linux-work/include/net/xfrm.h 2003-08-03 23:14:04.000000000 +0200 @@ -587,6 +587,8 @@ return !0; } +#ifdef CONFIG_XFRM + extern int __xfrm_policy_check(struct sock *, int dir, struct sk_buff *skb, unsigned short family); static inline int xfrm_policy_check(struct sock *sk, int dir, struct sk_buff *skb, unsigned short family) @@ -652,6 +654,26 @@ } } +#else + +static inline void xfrm_sk_free_policy(struct sock *sk) {} +static inline int xfrm_sk_clone_policy(struct sock *sk) { return 0; } +static inline int xfrm6_route_forward(struct sk_buff *skb) { return 1; } +static inline int xfrm4_route_forward(struct sk_buff *skb) { return 1; } +static inline int xfrm6_policy_check(struct sock *sk, int dir, struct sk_buff *skb) +{ + return 1; +} +static inline int xfrm4_policy_check(struct sock *sk, int dir, struct sk_buff *skb) +{ + return 1; +} +static inline int xfrm_policy_check(struct sock *sk, int dir, struct sk_buff *skb, unsigned short family) +{ + return 1; +} +#endif + static __inline__ xfrm_address_t *xfrm_flowi_daddr(struct flowi *fl, unsigned short family) { @@ -782,12 +804,32 @@ extern int xfrm_check_selectors(struct xfrm_state **x, int n, struct flowi *fl); extern int xfrm_check_output(struct xfrm_state *x, struct sk_buff *skb, unsigned short family); extern int xfrm4_rcv(struct sk_buff *skb); -extern int xfrm4_rcv_encap(struct sk_buff *skb, __u16 encap_type); extern int xfrm4_tunnel_register(struct xfrm_tunnel *handler); extern int xfrm4_tunnel_deregister(struct xfrm_tunnel *handler); extern int xfrm4_tunnel_check_size(struct sk_buff *skb); extern int xfrm6_rcv(struct sk_buff **pskb, unsigned int *nhoffp); + +#ifdef CONFIG_XFRM +extern int xfrm4_rcv_encap(struct sk_buff *skb, __u16 encap_type); extern int xfrm_user_policy(struct sock *sk, int optname, u8 *optval, int optlen); +extern int xfrm_dst_lookup(struct xfrm_dst **dst, struct flowi *fl, unsigned short family); +#else +static inline int xfrm_user_policy(struct sock *sk, int optname, u8 *optval, int optlen) +{ + return -ENOPROTOOPT; +} + +static inline int xfrm4_rcv_encap(struct sk_buff *skb, __u16 encap_type) +{ + /* should not happen */ + kfree_skb(skb); + return 0; +} +static inline int xfrm_dst_lookup(struct xfrm_dst **dst, struct flowi *fl, unsigned short family) +{ + return -EINVAL; +} +#endif void xfrm_policy_init(void); void xfrm4_policy_init(void); @@ -809,7 +851,6 @@ extern int xfrm_sk_policy_insert(struct sock *sk, int dir, struct xfrm_policy *pol); extern struct xfrm_policy *xfrm_sk_policy_lookup(struct sock *sk, int dir, struct flowi *fl); extern int xfrm_flush_bundles(struct xfrm_state *x); -extern int xfrm_dst_lookup(struct xfrm_dst **dst, struct flowi *fl, unsigned short family); extern wait_queue_head_t km_waitq; extern void km_state_expired(struct xfrm_state *x, int hard); diff -u linux-work/net/core/skbuff.c-XFRM linux-work/net/core/skbuff.c --- linux-work/net/core/skbuff.c-XFRM 2003-07-18 02:39:47.000000000 +0200 +++ linux-work/net/core/skbuff.c 2003-08-03 23:12:25.000000000 +0200 @@ -225,7 +225,7 @@ } dst_release(skb->dst); -#ifdef CONFIG_INET +#ifdef CONFIG_XFRM secpath_put(skb->sp); #endif if(skb->destructor) { diff -u linux-work/net/ipv4/Kconfig-XFRM linux-work/net/ipv4/Kconfig --- linux-work/net/ipv4/Kconfig-XFRM 2003-07-18 02:42:42.000000000 +0200 +++ linux-work/net/ipv4/Kconfig 2003-08-03 23:12:25.000000000 +0200 @@ -187,6 +187,7 @@ config NET_IPIP tristate "IP: tunneling" depends on INET + select XFRM ---help--- Tunneling means encapsulating data of one protocol type within another protocol and sending it over a channel that understands the @@ -205,6 +206,7 @@ config NET_IPGRE tristate "IP: GRE tunnels over IP" depends on INET + select XFRM help Tunneling means encapsulating data of one protocol type within another protocol and sending it over a channel that understands the @@ -343,6 +345,7 @@ config INET_AH tristate "IP: AH transformation" + select XFRM select CRYPTO select CRYPTO_HMAC select CRYPTO_MD5 @@ -354,6 +357,7 @@ config INET_ESP tristate "IP: ESP transformation" + select XFRM select CRYPTO select CRYPTO_HMAC select CRYPTO_MD5 @@ -366,6 +370,7 @@ config INET_IPCOMP tristate "IP: IPComp transformation" + select XFRM select CRYPTO select CRYPTO_DEFLATE ---help--- diff -u linux-work/net/ipv4/Makefile-XFRM linux-work/net/ipv4/Makefile --- linux-work/net/ipv4/Makefile-XFRM 2003-07-18 02:42:42.000000000 +0200 +++ linux-work/net/ipv4/Makefile 2003-08-03 23:12:25.000000000 +0200 @@ -23,4 +23,4 @@ obj-$(CONFIG_NETFILTER) += netfilter/ obj-$(CONFIG_IP_VS) += ipvs/ -obj-y += xfrm4_policy.o xfrm4_state.o xfrm4_input.o xfrm4_tunnel.o +obj-$(CONFIG_XFRM) += xfrm4_policy.o xfrm4_state.o xfrm4_input.o xfrm4_tunnel.o diff -u linux-work/net/ipv4/route.c-XFRM linux-work/net/ipv4/route.c --- linux-work/net/ipv4/route.c-XFRM 2003-07-18 02:39:31.000000000 +0200 +++ linux-work/net/ipv4/route.c 2003-08-03 23:12:25.000000000 +0200 @@ -2785,8 +2785,10 @@ create_proc_read_entry("net/rt_acct", 0, 0, ip_rt_acct_read, NULL); #endif #endif +#ifdef CONFIG_XFRM xfrm_init(); xfrm4_init(); +#endif out: return rc; out_enomem: diff -u linux-work/net/ipv4/udp.c-XFRM linux-work/net/ipv4/udp.c --- linux-work/net/ipv4/udp.c-XFRM 2003-07-18 02:42:43.000000000 +0200 +++ linux-work/net/ipv4/udp.c 2003-08-03 23:31:05.000000000 +0200 @@ -938,6 +938,9 @@ */ static int udp_encap_rcv(struct sock * sk, struct sk_buff *skb) { +#ifndef CONFIG_XFRM + return 1; +#else struct udp_opt *up = udp_sk(sk); struct udphdr *uh = skb->h.uh; struct iphdr *iph; @@ -997,10 +1000,12 @@ return -1; default: - printk(KERN_INFO "udp_encap_rcv(): Unhandled UDP encap type: %u\n", - encap_type); + if (net_ratelimit()) + printk(KERN_INFO "udp_encap_rcv(): Unhandled UDP encap type: %u\n", + encap_type); return 1; } +#endif } /* returns: diff -u linux-work/net/ipv6/Kconfig-XFRM linux-work/net/ipv6/Kconfig --- linux-work/net/ipv6/Kconfig-XFRM 2003-07-18 02:39:29.000000000 +0200 +++ linux-work/net/ipv6/Kconfig 2003-08-03 23:12:25.000000000 +0200 @@ -4,6 +4,7 @@ config IPV6_PRIVACY bool "IPv6: Privacy Extensions (RFC 3041) support" depends on IPV6 + select XFRM select CRYPTO select CRYPTO_MD5 ---help--- @@ -22,6 +23,7 @@ config INET6_AH tristate "IPv6: AH transformation" depends on IPV6 + select XFRM select CRYPTO select CRYPTO_HMAC select CRYPTO_MD5 @@ -34,6 +36,7 @@ config INET6_ESP tristate "IPv6: ESP transformation" depends on IPV6 + select XFRM select CRYPTO select CRYPTO_HMAC select CRYPTO_MD5 @@ -47,6 +50,7 @@ config INET6_IPCOMP tristate "IPv6: IPComp transformation" depends on IPV6 + select XFRM select CRYPTO select CRYPTO_DEFLATE ---help--- @@ -57,6 +61,7 @@ config IPV6_TUNNEL tristate "IPv6: IPv6-in-IPv6 tunnel" + select XFRM depends on IPV6 ---help--- Support for IPv6-in-IPv6 tunnels described in RFC 2473. diff -u linux-work/net/ipv6/Makefile-XFRM linux-work/net/ipv6/Makefile --- linux-work/net/ipv6/Makefile-XFRM 2003-07-18 02:39:29.000000000 +0200 +++ linux-work/net/ipv6/Makefile 2003-08-03 23:12:25.000000000 +0200 @@ -8,8 +8,9 @@ route.o ip6_fib.o ipv6_sockglue.o ndisc.o udp.o raw.o \ protocol.o icmp.o mcast.o reassembly.o tcp_ipv6.o \ exthdrs.o sysctl_net_ipv6.o datagram.o proc.o \ - ip6_flowlabel.o ipv6_syms.o \ - xfrm6_policy.o xfrm6_state.o xfrm6_input.o + ip6_flowlabel.o ipv6_syms.o + +obj-$(CONFIG_XFRM) += xfrm6_policy.o xfrm6_state.o xfrm6_input.o obj-$(CONFIG_INET6_AH) += ah6.o obj-$(CONFIG_INET6_ESP) += esp6.o diff -u linux-work/net/ipv6/ipv6_syms.c-XFRM linux-work/net/ipv6/ipv6_syms.c --- linux-work/net/ipv6/ipv6_syms.c-XFRM 2003-07-18 02:39:31.000000000 +0200 +++ linux-work/net/ipv6/ipv6_syms.c 2003-08-03 23:14:41.000000000 +0200 @@ -36,7 +36,9 @@ EXPORT_SYMBOL(in6addr_loopback); EXPORT_SYMBOL(in6_dev_finish_destroy); EXPORT_SYMBOL(ip6_find_1stfragopt); +#ifdef CONFIG_XFRM EXPORT_SYMBOL(xfrm6_rcv); +#endif EXPORT_SYMBOL(rt6_lookup); EXPORT_SYMBOL(fl6_sock_lookup); EXPORT_SYMBOL(ipv6_ext_hdr); diff -u linux-work/net/ipv6/route.c-XFRM linux-work/net/ipv6/route.c --- linux-work/net/ipv6/route.c-XFRM 2003-07-28 23:12:32.000000000 +0200 +++ linux-work/net/ipv6/route.c 2003-08-03 23:12:25.000000000 +0200 @@ -1988,7 +1988,9 @@ if (p) p->proc_fops = &rt6_stats_seq_fops; #endif +#ifdef CONFIG_XFRM xfrm6_init(); +#endif } #ifdef MODULE diff -u linux-work/net/xfrm/Kconfig-XFRM linux-work/net/xfrm/Kconfig --- linux-work/net/xfrm/Kconfig-XFRM 2003-05-27 03:00:40.000000000 +0200 +++ linux-work/net/xfrm/Kconfig 2003-08-03 23:12:25.000000000 +0200 @@ -1,9 +1,13 @@ # # XFRM configuration # +config XFRM + bool + depends on NET + config XFRM_USER tristate "IPsec user configuration interface" - depends on INET + depends on INET && XFRM ---help--- Support for IPsec user configuration interface used by native Linux tools. diff -u linux-work/net/xfrm/Makefile-XFRM linux-work/net/xfrm/Makefile --- linux-work/net/xfrm/Makefile-XFRM 2003-05-27 03:01:03.000000000 +0200 +++ linux-work/net/xfrm/Makefile 2003-08-03 23:12:25.000000000 +0200 @@ -2,6 +2,7 @@ # Makefile for the XFRM subsystem. # -obj-y := xfrm_policy.o xfrm_state.o xfrm_input.o xfrm_algo.o xfrm_output.o +obj-$(CONFIG_XFRM) := xfrm_policy.o xfrm_state.o xfrm_input.o xfrm_algo.o xfrm_output.o \ + xfrm_export.o obj-$(CONFIG_XFRM_USER) += xfrm_user.o diff -u linux-work/net/xfrm/xfrm_export.c-XFRM linux-work/net/xfrm/xfrm_export.c --- linux-work/net/xfrm/xfrm_export.c-XFRM 2003-08-03 23:12:25.000000000 +0200 +++ linux-work/net/xfrm/xfrm_export.c 2003-08-03 23:16:06.000000000 +0200 @@ -0,0 +1,76 @@ +#include +#include + +EXPORT_SYMBOL(xfrm_user_policy); +EXPORT_SYMBOL(km_waitq); +EXPORT_SYMBOL(km_new_mapping); +EXPORT_SYMBOL(xfrm_cfg_sem); +EXPORT_SYMBOL(xfrm_policy_alloc); +EXPORT_SYMBOL(__xfrm_policy_destroy); +EXPORT_SYMBOL(xfrm_lookup); +EXPORT_SYMBOL(__xfrm_policy_check); +EXPORT_SYMBOL(__xfrm_route_forward); +EXPORT_SYMBOL(xfrm_state_alloc); +EXPORT_SYMBOL(__xfrm_state_destroy); +EXPORT_SYMBOL(xfrm_state_find); +EXPORT_SYMBOL(xfrm_state_insert); +EXPORT_SYMBOL(xfrm_state_add); +EXPORT_SYMBOL(xfrm_state_update); +EXPORT_SYMBOL(xfrm_state_check_expire); +EXPORT_SYMBOL(xfrm_state_check_space); +EXPORT_SYMBOL(xfrm_state_lookup); +EXPORT_SYMBOL(xfrm_state_register_afinfo); +EXPORT_SYMBOL(xfrm_state_unregister_afinfo); +EXPORT_SYMBOL(xfrm_state_get_afinfo); +EXPORT_SYMBOL(xfrm_state_put_afinfo); +EXPORT_SYMBOL(xfrm_state_delete_tunnel); +EXPORT_SYMBOL(xfrm_replay_check); +EXPORT_SYMBOL(xfrm_replay_advance); +EXPORT_SYMBOL(xfrm_check_selectors); +EXPORT_SYMBOL(xfrm_check_output); +EXPORT_SYMBOL(__secpath_destroy); +EXPORT_SYMBOL(xfrm_get_acqseq); +EXPORT_SYMBOL(xfrm_parse_spi); +EXPORT_SYMBOL(xfrm4_rcv); +EXPORT_SYMBOL(xfrm4_tunnel_register); +EXPORT_SYMBOL(xfrm4_tunnel_deregister); +EXPORT_SYMBOL(xfrm4_tunnel_check_size); +EXPORT_SYMBOL(xfrm_register_type); +EXPORT_SYMBOL(xfrm_unregister_type); +EXPORT_SYMBOL(xfrm_get_type); +EXPORT_SYMBOL(inet_peer_idlock); +EXPORT_SYMBOL(xfrm_register_km); +EXPORT_SYMBOL(xfrm_unregister_km); +EXPORT_SYMBOL(xfrm_state_delete); +EXPORT_SYMBOL(xfrm_state_walk); +EXPORT_SYMBOL(xfrm_find_acq_byseq); +EXPORT_SYMBOL(xfrm_find_acq); +EXPORT_SYMBOL(xfrm_alloc_spi); +EXPORT_SYMBOL(xfrm_state_flush); +EXPORT_SYMBOL(xfrm_policy_kill); +EXPORT_SYMBOL(xfrm_policy_bysel); +EXPORT_SYMBOL(xfrm_policy_insert); +EXPORT_SYMBOL(xfrm_policy_walk); +EXPORT_SYMBOL(xfrm_policy_flush); +EXPORT_SYMBOL(xfrm_policy_byid); +EXPORT_SYMBOL(xfrm_policy_list); +EXPORT_SYMBOL(xfrm_dst_lookup); +EXPORT_SYMBOL(xfrm_policy_register_afinfo); +EXPORT_SYMBOL(xfrm_policy_unregister_afinfo); +EXPORT_SYMBOL(xfrm_policy_get_afinfo); +EXPORT_SYMBOL(xfrm_policy_put_afinfo); + +EXPORT_SYMBOL_GPL(xfrm_probe_algs); +EXPORT_SYMBOL_GPL(xfrm_count_auth_supported); +EXPORT_SYMBOL_GPL(xfrm_count_enc_supported); +EXPORT_SYMBOL_GPL(xfrm_aalg_get_byidx); +EXPORT_SYMBOL_GPL(xfrm_ealg_get_byidx); +EXPORT_SYMBOL_GPL(xfrm_calg_get_byidx); +EXPORT_SYMBOL_GPL(xfrm_aalg_get_byid); +EXPORT_SYMBOL_GPL(xfrm_ealg_get_byid); +EXPORT_SYMBOL_GPL(xfrm_calg_get_byid); +EXPORT_SYMBOL_GPL(xfrm_aalg_get_byname); +EXPORT_SYMBOL_GPL(xfrm_ealg_get_byname); +EXPORT_SYMBOL_GPL(xfrm_calg_get_byname); + +EXPORT_SYMBOL_GPL(skb_icv_walk); diff -u linux-work/net/Kconfig-XFRM linux-work/net/Kconfig --- linux-work/net/Kconfig-XFRM 2003-05-27 03:00:21.000000000 +0200 +++ linux-work/net/Kconfig 2003-08-03 23:12:24.000000000 +0200 @@ -143,6 +143,7 @@ config NET_KEY tristate "PF_KEY sockets" + select XFRM ---help--- PF_KEYv2 socket family, compatible to KAME ones. They are required if you are going to use IPsec tools ported diff -u linux-work/net/netsyms.c-XFRM linux-work/net/netsyms.c --- linux-work/net/netsyms.c-XFRM 2003-07-28 23:12:33.000000000 +0200 +++ linux-work/net/netsyms.c 2003-08-03 23:16:23.000000000 +0200 @@ -56,7 +56,6 @@ #include #include #include -#include #if defined(CONFIG_INET_AH) || defined(CONFIG_INET_AH_MODULE) || defined(CONFIG_INET6_AH) || defined(CONFIG_INET6_AH_MODULE) #include #endif @@ -294,78 +293,6 @@ /* needed for ip_gre -cw */ EXPORT_SYMBOL(ip_statistics); -EXPORT_SYMBOL(xfrm_user_policy); -EXPORT_SYMBOL(km_waitq); -EXPORT_SYMBOL(km_new_mapping); -EXPORT_SYMBOL(xfrm_cfg_sem); -EXPORT_SYMBOL(xfrm_policy_alloc); -EXPORT_SYMBOL(__xfrm_policy_destroy); -EXPORT_SYMBOL(xfrm_lookup); -EXPORT_SYMBOL(__xfrm_policy_check); -EXPORT_SYMBOL(__xfrm_route_forward); -EXPORT_SYMBOL(xfrm_state_alloc); -EXPORT_SYMBOL(__xfrm_state_destroy); -EXPORT_SYMBOL(xfrm_state_find); -EXPORT_SYMBOL(xfrm_state_insert); -EXPORT_SYMBOL(xfrm_state_add); -EXPORT_SYMBOL(xfrm_state_update); -EXPORT_SYMBOL(xfrm_state_check_expire); -EXPORT_SYMBOL(xfrm_state_check_space); -EXPORT_SYMBOL(xfrm_state_lookup); -EXPORT_SYMBOL(xfrm_state_register_afinfo); -EXPORT_SYMBOL(xfrm_state_unregister_afinfo); -EXPORT_SYMBOL(xfrm_state_get_afinfo); -EXPORT_SYMBOL(xfrm_state_put_afinfo); -EXPORT_SYMBOL(xfrm_state_delete_tunnel); -EXPORT_SYMBOL(xfrm_replay_check); -EXPORT_SYMBOL(xfrm_replay_advance); -EXPORT_SYMBOL(xfrm_check_selectors); -EXPORT_SYMBOL(xfrm_check_output); -EXPORT_SYMBOL(__secpath_destroy); -EXPORT_SYMBOL(xfrm_get_acqseq); -EXPORT_SYMBOL(xfrm_parse_spi); -EXPORT_SYMBOL(xfrm4_rcv); -EXPORT_SYMBOL(xfrm4_tunnel_register); -EXPORT_SYMBOL(xfrm4_tunnel_deregister); -EXPORT_SYMBOL(xfrm4_tunnel_check_size); -EXPORT_SYMBOL(xfrm_register_type); -EXPORT_SYMBOL(xfrm_unregister_type); -EXPORT_SYMBOL(xfrm_get_type); -EXPORT_SYMBOL(inet_peer_idlock); -EXPORT_SYMBOL(xfrm_register_km); -EXPORT_SYMBOL(xfrm_unregister_km); -EXPORT_SYMBOL(xfrm_state_delete); -EXPORT_SYMBOL(xfrm_state_walk); -EXPORT_SYMBOL(xfrm_find_acq_byseq); -EXPORT_SYMBOL(xfrm_find_acq); -EXPORT_SYMBOL(xfrm_alloc_spi); -EXPORT_SYMBOL(xfrm_state_flush); -EXPORT_SYMBOL(xfrm_policy_kill); -EXPORT_SYMBOL(xfrm_policy_bysel); -EXPORT_SYMBOL(xfrm_policy_insert); -EXPORT_SYMBOL(xfrm_policy_walk); -EXPORT_SYMBOL(xfrm_policy_flush); -EXPORT_SYMBOL(xfrm_policy_byid); -EXPORT_SYMBOL(xfrm_policy_list); -EXPORT_SYMBOL(xfrm_dst_lookup); -EXPORT_SYMBOL(xfrm_policy_register_afinfo); -EXPORT_SYMBOL(xfrm_policy_unregister_afinfo); -EXPORT_SYMBOL(xfrm_policy_get_afinfo); -EXPORT_SYMBOL(xfrm_policy_put_afinfo); - -EXPORT_SYMBOL_GPL(xfrm_probe_algs); -EXPORT_SYMBOL_GPL(xfrm_count_auth_supported); -EXPORT_SYMBOL_GPL(xfrm_count_enc_supported); -EXPORT_SYMBOL_GPL(xfrm_aalg_get_byidx); -EXPORT_SYMBOL_GPL(xfrm_ealg_get_byidx); -EXPORT_SYMBOL_GPL(xfrm_calg_get_byidx); -EXPORT_SYMBOL_GPL(xfrm_aalg_get_byid); -EXPORT_SYMBOL_GPL(xfrm_ealg_get_byid); -EXPORT_SYMBOL_GPL(xfrm_calg_get_byid); -EXPORT_SYMBOL_GPL(xfrm_aalg_get_byname); -EXPORT_SYMBOL_GPL(xfrm_ealg_get_byname); -EXPORT_SYMBOL_GPL(xfrm_calg_get_byname); -EXPORT_SYMBOL_GPL(skb_icv_walk); #if defined(CONFIG_INET_ESP) || defined(CONFIG_INET_ESP_MODULE) || defined(CONFIG_INET6_ESP) || defined(CONFIG_INET6_ESP_MODULE) EXPORT_SYMBOL_GPL(skb_cow_data); EXPORT_SYMBOL_GPL(pskb_put); From yoshfuji@linux-ipv6.org Mon Aug 4 05:58:03 2003 Received: with ECARTIS (v1.0.0; list netdev); Mon, 04 Aug 2003 05:58:08 -0700 (PDT) Received: from yue.hongo.wide.ad.jp (yue.hongo.wide.ad.jp [203.178.139.94]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h74Cw1Fl018295 for ; Mon, 4 Aug 2003 05:58:03 -0700 Received: from localhost (localhost [127.0.0.1]) by yue.hongo.wide.ad.jp (8.12.3+3.5Wbeta/8.12.3/Debian-5) with ESMTP id h74Cw11M000463; Mon, 4 Aug 2003 21:58:01 +0900 Date: Mon, 04 Aug 2003 21:58:01 +0900 (JST) Message-Id: <20030804.215801.124854897.yoshfuji@linux-ipv6.org> To: ak@muc.de Cc: netdev@oss.sgi.com, yoshfuji@linux-ipv6.org Subject: Re: [PATCH] Make XFRM optional From: YOSHIFUJI Hideaki / =?iso-2022-jp?B?GyRCNUhGIzFRTEAbKEI=?= In-Reply-To: <20030804125022.GA8167@averell> References: <20030804125022.GA8167@averell> Organization: USAGI Project X-URL: http://www.yoshifuji.org/%7Ehideaki/ X-Fingerprint: 90 22 65 EB 1E CF 3A D1 0B DF 80 D8 48 07 F8 94 E0 62 0E EA X-PGP-Key-URL: http://www.yoshifuji.org/%7Ehideaki/hideaki@yoshifuji.org.asc X-Face: "5$Al-.M>NJ%a'@hhZdQm:."qn~PA^gq4o*>iCFToq*bAi#4FRtx}enhuQKz7fNqQz\BYU] $~O_5m-9'}MIs`XGwIEscw;e5b>n"B_?j/AkL~i/MEaZBLP X-Mailer: Mew version 2.2 on Emacs 20.7 / Mule 4.1 (AOI) Mime-Version: 1.0 Content-Type: Text/Plain; charset=us-ascii Content-Transfer-Encoding: 7bit X-archive-position: 4513 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: yoshfuji@linux-ipv6.org Precedence: bulk X-list: netdev Hello. In article <20030804125022.GA8167@averell> (at Mon, 4 Aug 2003 14:50:22 +0200), Andi Kleen says: > diff -u linux-work/net/ipv6/Kconfig-XFRM linux-work/net/ipv6/Kconfig > --- linux-work/net/ipv6/Kconfig-XFRM 2003-07-18 02:39:29.000000000 +0200 > +++ linux-work/net/ipv6/Kconfig 2003-08-03 23:12:25.000000000 +0200 > @@ -4,6 +4,7 @@ > config IPV6_PRIVACY > bool "IPv6: Privacy Extensions (RFC 3041) support" > depends on IPV6 > + select XFRM > select CRYPTO > select CRYPTO_MD5 > ---help--- We do not need this. > @@ -57,6 +61,7 @@ > > config IPV6_TUNNEL > tristate "IPv6: IPv6-in-IPv6 tunnel" > + select XFRM > depends on IPV6 > ---help--- > Support for IPv6-in-IPv6 tunnels described in RFC 2473. We do not need this for now. -- Hideaki YOSHIFUJI @ USAGI Project GPG FP: 9022 65EB 1ECF 3AD1 0BDF 80D8 4807 F894 E062 0EEA From ak@muc.de Mon Aug 4 06:04:14 2003 Received: with ECARTIS (v1.0.0; list netdev); Mon, 04 Aug 2003 06:04:29 -0700 (PDT) Received: from colin2.muc.de (qmailr@colin2.muc.de [193.149.48.15]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h74D4CFl018727 for ; Mon, 4 Aug 2003 06:04:14 -0700 Received: (qmail 39137 invoked by uid 3709); 4 Aug 2003 13:04:08 -0000 Date: 4 Aug 2003 15:04:08 +0200 Date: Mon, 4 Aug 2003 15:04:08 +0200 From: Andi Kleen To: "YOSHIFUJI Hideaki / ?$B5HF#1QL@" Cc: ak@muc.de, netdev@oss.sgi.com Subject: Re: [PATCH] Make XFRM optional Message-ID: <20030804130408.GA36367@colin2.muc.de> References: <20030804125022.GA8167@averell> <20030804.215801.124854897.yoshfuji@linux-ipv6.org> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20030804.215801.124854897.yoshfuji@linux-ipv6.org> User-Agent: Mutt/1.4.1i X-archive-position: 4514 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: ak@colin2.muc.de Precedence: bulk X-list: netdev On Mon, Aug 04, 2003 at 09:58:01PM +0900, YOSHIFUJI Hideaki / ?$B5HF#1QL@ wrote: > Hello. > > In article <20030804125022.GA8167@averell> (at Mon, 4 Aug 2003 14:50:22 +0200), Andi Kleen says: > > > diff -u linux-work/net/ipv6/Kconfig-XFRM linux-work/net/ipv6/Kconfig > > --- linux-work/net/ipv6/Kconfig-XFRM 2003-07-18 02:39:29.000000000 +0200 > > +++ linux-work/net/ipv6/Kconfig 2003-08-03 23:12:25.000000000 +0200 > > @@ -4,6 +4,7 @@ > > config IPV6_PRIVACY > > bool "IPv6: Privacy Extensions (RFC 3041) support" > > depends on IPV6 > > + select XFRM > > select CRYPTO > > select CRYPTO_MD5 > > ---help--- > > We do not need this. Thanks for the feedback. Here is a new patch with the two hunks removed. -Andi diff -u linux-work/include/net/dst.h-XFRM linux-work/include/net/dst.h --- linux-work/include/net/dst.h-XFRM 2003-07-18 02:40:02.000000000 +0200 +++ linux-work/include/net/dst.h 2003-08-03 23:12:24.000000000 +0200 @@ -247,8 +247,16 @@ extern void dst_init(void); struct flowi; +#ifndef CONFIG_XFRM +static inline int xfrm_lookup(struct dst_entry **dst_p, struct flowi *fl, + struct sock *sk, int flags) +{ + return 0; +} +#else extern int xfrm_lookup(struct dst_entry **dst_p, struct flowi *fl, struct sock *sk, int flags); #endif +#endif #endif /* _NET_DST_H */ diff -u linux-work/include/net/xfrm.h-XFRM linux-work/include/net/xfrm.h --- linux-work/include/net/xfrm.h-XFRM 2003-07-28 23:12:30.000000000 +0200 +++ linux-work/include/net/xfrm.h 2003-08-03 23:14:04.000000000 +0200 @@ -587,6 +587,8 @@ return !0; } +#ifdef CONFIG_XFRM + extern int __xfrm_policy_check(struct sock *, int dir, struct sk_buff *skb, unsigned short family); static inline int xfrm_policy_check(struct sock *sk, int dir, struct sk_buff *skb, unsigned short family) @@ -652,6 +654,26 @@ } } +#else + +static inline void xfrm_sk_free_policy(struct sock *sk) {} +static inline int xfrm_sk_clone_policy(struct sock *sk) { return 0; } +static inline int xfrm6_route_forward(struct sk_buff *skb) { return 1; } +static inline int xfrm4_route_forward(struct sk_buff *skb) { return 1; } +static inline int xfrm6_policy_check(struct sock *sk, int dir, struct sk_buff *skb) +{ + return 1; +} +static inline int xfrm4_policy_check(struct sock *sk, int dir, struct sk_buff *skb) +{ + return 1; +} +static inline int xfrm_policy_check(struct sock *sk, int dir, struct sk_buff *skb, unsigned short family) +{ + return 1; +} +#endif + static __inline__ xfrm_address_t *xfrm_flowi_daddr(struct flowi *fl, unsigned short family) { @@ -782,12 +804,32 @@ extern int xfrm_check_selectors(struct xfrm_state **x, int n, struct flowi *fl); extern int xfrm_check_output(struct xfrm_state *x, struct sk_buff *skb, unsigned short family); extern int xfrm4_rcv(struct sk_buff *skb); -extern int xfrm4_rcv_encap(struct sk_buff *skb, __u16 encap_type); extern int xfrm4_tunnel_register(struct xfrm_tunnel *handler); extern int xfrm4_tunnel_deregister(struct xfrm_tunnel *handler); extern int xfrm4_tunnel_check_size(struct sk_buff *skb); extern int xfrm6_rcv(struct sk_buff **pskb, unsigned int *nhoffp); + +#ifdef CONFIG_XFRM +extern int xfrm4_rcv_encap(struct sk_buff *skb, __u16 encap_type); extern int xfrm_user_policy(struct sock *sk, int optname, u8 *optval, int optlen); +extern int xfrm_dst_lookup(struct xfrm_dst **dst, struct flowi *fl, unsigned short family); +#else +static inline int xfrm_user_policy(struct sock *sk, int optname, u8 *optval, int optlen) +{ + return -ENOPROTOOPT; +} + +static inline int xfrm4_rcv_encap(struct sk_buff *skb, __u16 encap_type) +{ + /* should not happen */ + kfree_skb(skb); + return 0; +} +static inline int xfrm_dst_lookup(struct xfrm_dst **dst, struct flowi *fl, unsigned short family) +{ + return -EINVAL; +} +#endif void xfrm_policy_init(void); void xfrm4_policy_init(void); @@ -809,7 +851,6 @@ extern int xfrm_sk_policy_insert(struct sock *sk, int dir, struct xfrm_policy *pol); extern struct xfrm_policy *xfrm_sk_policy_lookup(struct sock *sk, int dir, struct flowi *fl); extern int xfrm_flush_bundles(struct xfrm_state *x); -extern int xfrm_dst_lookup(struct xfrm_dst **dst, struct flowi *fl, unsigned short family); extern wait_queue_head_t km_waitq; extern void km_state_expired(struct xfrm_state *x, int hard); diff -u linux-work/net/core/skbuff.c-XFRM linux-work/net/core/skbuff.c --- linux-work/net/core/skbuff.c-XFRM 2003-07-18 02:39:47.000000000 +0200 +++ linux-work/net/core/skbuff.c 2003-08-03 23:12:25.000000000 +0200 @@ -225,7 +225,7 @@ } dst_release(skb->dst); -#ifdef CONFIG_INET +#ifdef CONFIG_XFRM secpath_put(skb->sp); #endif if(skb->destructor) { diff -u linux-work/net/ipv4/Kconfig-XFRM linux-work/net/ipv4/Kconfig --- linux-work/net/ipv4/Kconfig-XFRM 2003-07-18 02:42:42.000000000 +0200 +++ linux-work/net/ipv4/Kconfig 2003-08-03 23:12:25.000000000 +0200 @@ -187,6 +187,7 @@ config NET_IPIP tristate "IP: tunneling" depends on INET + select XFRM ---help--- Tunneling means encapsulating data of one protocol type within another protocol and sending it over a channel that understands the @@ -205,6 +206,7 @@ config NET_IPGRE tristate "IP: GRE tunnels over IP" depends on INET + select XFRM help Tunneling means encapsulating data of one protocol type within another protocol and sending it over a channel that understands the @@ -343,6 +345,7 @@ config INET_AH tristate "IP: AH transformation" + select XFRM select CRYPTO select CRYPTO_HMAC select CRYPTO_MD5 @@ -354,6 +357,7 @@ config INET_ESP tristate "IP: ESP transformation" + select XFRM select CRYPTO select CRYPTO_HMAC select CRYPTO_MD5 @@ -366,6 +370,7 @@ config INET_IPCOMP tristate "IP: IPComp transformation" + select XFRM select CRYPTO select CRYPTO_DEFLATE ---help--- diff -u linux-work/net/ipv4/Makefile-XFRM linux-work/net/ipv4/Makefile --- linux-work/net/ipv4/Makefile-XFRM 2003-07-18 02:42:42.000000000 +0200 +++ linux-work/net/ipv4/Makefile 2003-08-03 23:12:25.000000000 +0200 @@ -23,4 +23,4 @@ obj-$(CONFIG_NETFILTER) += netfilter/ obj-$(CONFIG_IP_VS) += ipvs/ -obj-y += xfrm4_policy.o xfrm4_state.o xfrm4_input.o xfrm4_tunnel.o +obj-$(CONFIG_XFRM) += xfrm4_policy.o xfrm4_state.o xfrm4_input.o xfrm4_tunnel.o diff -u linux-work/net/ipv4/route.c-XFRM linux-work/net/ipv4/route.c --- linux-work/net/ipv4/route.c-XFRM 2003-07-18 02:39:31.000000000 +0200 +++ linux-work/net/ipv4/route.c 2003-08-03 23:12:25.000000000 +0200 @@ -2785,8 +2785,10 @@ create_proc_read_entry("net/rt_acct", 0, 0, ip_rt_acct_read, NULL); #endif #endif +#ifdef CONFIG_XFRM xfrm_init(); xfrm4_init(); +#endif out: return rc; out_enomem: diff -u linux-work/net/ipv4/udp.c-XFRM linux-work/net/ipv4/udp.c --- linux-work/net/ipv4/udp.c-XFRM 2003-07-18 02:42:43.000000000 +0200 +++ linux-work/net/ipv4/udp.c 2003-08-03 23:31:05.000000000 +0200 @@ -938,6 +938,9 @@ */ static int udp_encap_rcv(struct sock * sk, struct sk_buff *skb) { +#ifndef CONFIG_XFRM + return 1; +#else struct udp_opt *up = udp_sk(sk); struct udphdr *uh = skb->h.uh; struct iphdr *iph; @@ -997,10 +1000,12 @@ return -1; default: - printk(KERN_INFO "udp_encap_rcv(): Unhandled UDP encap type: %u\n", - encap_type); + if (net_ratelimit()) + printk(KERN_INFO "udp_encap_rcv(): Unhandled UDP encap type: %u\n", + encap_type); return 1; } +#endif } /* returns: diff -u linux-work/net/ipv6/Kconfig-XFRM linux-work/net/ipv6/Kconfig --- linux-work/net/ipv6/Kconfig-XFRM 2003-07-18 02:39:29.000000000 +0200 +++ linux-work/net/ipv6/Kconfig 2003-08-03 23:12:25.000000000 +0200 @@ -22,6 +23,7 @@ config INET6_AH tristate "IPv6: AH transformation" depends on IPV6 + select XFRM select CRYPTO select CRYPTO_HMAC select CRYPTO_MD5 @@ -34,6 +36,7 @@ config INET6_ESP tristate "IPv6: ESP transformation" depends on IPV6 + select XFRM select CRYPTO select CRYPTO_HMAC select CRYPTO_MD5 @@ -47,6 +50,7 @@ config INET6_IPCOMP tristate "IPv6: IPComp transformation" depends on IPV6 + select XFRM select CRYPTO select CRYPTO_DEFLATE ---help--- diff -u linux-work/net/ipv6/Makefile-XFRM linux-work/net/ipv6/Makefile --- linux-work/net/ipv6/Makefile-XFRM 2003-07-18 02:39:29.000000000 +0200 +++ linux-work/net/ipv6/Makefile 2003-08-03 23:12:25.000000000 +0200 @@ -8,8 +8,9 @@ route.o ip6_fib.o ipv6_sockglue.o ndisc.o udp.o raw.o \ protocol.o icmp.o mcast.o reassembly.o tcp_ipv6.o \ exthdrs.o sysctl_net_ipv6.o datagram.o proc.o \ - ip6_flowlabel.o ipv6_syms.o \ - xfrm6_policy.o xfrm6_state.o xfrm6_input.o + ip6_flowlabel.o ipv6_syms.o + +obj-$(CONFIG_XFRM) += xfrm6_policy.o xfrm6_state.o xfrm6_input.o obj-$(CONFIG_INET6_AH) += ah6.o obj-$(CONFIG_INET6_ESP) += esp6.o diff -u linux-work/net/ipv6/ipv6_syms.c-XFRM linux-work/net/ipv6/ipv6_syms.c --- linux-work/net/ipv6/ipv6_syms.c-XFRM 2003-07-18 02:39:31.000000000 +0200 +++ linux-work/net/ipv6/ipv6_syms.c 2003-08-03 23:14:41.000000000 +0200 @@ -36,7 +36,9 @@ EXPORT_SYMBOL(in6addr_loopback); EXPORT_SYMBOL(in6_dev_finish_destroy); EXPORT_SYMBOL(ip6_find_1stfragopt); +#ifdef CONFIG_XFRM EXPORT_SYMBOL(xfrm6_rcv); +#endif EXPORT_SYMBOL(rt6_lookup); EXPORT_SYMBOL(fl6_sock_lookup); EXPORT_SYMBOL(ipv6_ext_hdr); diff -u linux-work/net/ipv6/route.c-XFRM linux-work/net/ipv6/route.c --- linux-work/net/ipv6/route.c-XFRM 2003-07-28 23:12:32.000000000 +0200 +++ linux-work/net/ipv6/route.c 2003-08-03 23:12:25.000000000 +0200 @@ -1988,7 +1988,9 @@ if (p) p->proc_fops = &rt6_stats_seq_fops; #endif +#ifdef CONFIG_XFRM xfrm6_init(); +#endif } #ifdef MODULE diff -u linux-work/net/xfrm/Kconfig-XFRM linux-work/net/xfrm/Kconfig --- linux-work/net/xfrm/Kconfig-XFRM 2003-05-27 03:00:40.000000000 +0200 +++ linux-work/net/xfrm/Kconfig 2003-08-03 23:12:25.000000000 +0200 @@ -1,9 +1,13 @@ # # XFRM configuration # +config XFRM + bool + depends on NET + config XFRM_USER tristate "IPsec user configuration interface" - depends on INET + depends on INET && XFRM ---help--- Support for IPsec user configuration interface used by native Linux tools. diff -u linux-work/net/xfrm/Makefile-XFRM linux-work/net/xfrm/Makefile --- linux-work/net/xfrm/Makefile-XFRM 2003-05-27 03:01:03.000000000 +0200 +++ linux-work/net/xfrm/Makefile 2003-08-03 23:12:25.000000000 +0200 @@ -2,6 +2,7 @@ # Makefile for the XFRM subsystem. # -obj-y := xfrm_policy.o xfrm_state.o xfrm_input.o xfrm_algo.o xfrm_output.o +obj-$(CONFIG_XFRM) := xfrm_policy.o xfrm_state.o xfrm_input.o xfrm_algo.o xfrm_output.o \ + xfrm_export.o obj-$(CONFIG_XFRM_USER) += xfrm_user.o diff -u linux-work/net/xfrm/xfrm_export.c-XFRM linux-work/net/xfrm/xfrm_export.c --- linux-work/net/xfrm/xfrm_export.c-XFRM 2003-08-03 23:12:25.000000000 +0200 +++ linux-work/net/xfrm/xfrm_export.c 2003-08-03 23:16:06.000000000 +0200 @@ -0,0 +1,76 @@ +#include +#include + +EXPORT_SYMBOL(xfrm_user_policy); +EXPORT_SYMBOL(km_waitq); +EXPORT_SYMBOL(km_new_mapping); +EXPORT_SYMBOL(xfrm_cfg_sem); +EXPORT_SYMBOL(xfrm_policy_alloc); +EXPORT_SYMBOL(__xfrm_policy_destroy); +EXPORT_SYMBOL(xfrm_lookup); +EXPORT_SYMBOL(__xfrm_policy_check); +EXPORT_SYMBOL(__xfrm_route_forward); +EXPORT_SYMBOL(xfrm_state_alloc); +EXPORT_SYMBOL(__xfrm_state_destroy); +EXPORT_SYMBOL(xfrm_state_find); +EXPORT_SYMBOL(xfrm_state_insert); +EXPORT_SYMBOL(xfrm_state_add); +EXPORT_SYMBOL(xfrm_state_update); +EXPORT_SYMBOL(xfrm_state_check_expire); +EXPORT_SYMBOL(xfrm_state_check_space); +EXPORT_SYMBOL(xfrm_state_lookup); +EXPORT_SYMBOL(xfrm_state_register_afinfo); +EXPORT_SYMBOL(xfrm_state_unregister_afinfo); +EXPORT_SYMBOL(xfrm_state_get_afinfo); +EXPORT_SYMBOL(xfrm_state_put_afinfo); +EXPORT_SYMBOL(xfrm_state_delete_tunnel); +EXPORT_SYMBOL(xfrm_replay_check); +EXPORT_SYMBOL(xfrm_replay_advance); +EXPORT_SYMBOL(xfrm_check_selectors); +EXPORT_SYMBOL(xfrm_check_output); +EXPORT_SYMBOL(__secpath_destroy); +EXPORT_SYMBOL(xfrm_get_acqseq); +EXPORT_SYMBOL(xfrm_parse_spi); +EXPORT_SYMBOL(xfrm4_rcv); +EXPORT_SYMBOL(xfrm4_tunnel_register); +EXPORT_SYMBOL(xfrm4_tunnel_deregister); +EXPORT_SYMBOL(xfrm4_tunnel_check_size); +EXPORT_SYMBOL(xfrm_register_type); +EXPORT_SYMBOL(xfrm_unregister_type); +EXPORT_SYMBOL(xfrm_get_type); +EXPORT_SYMBOL(inet_peer_idlock); +EXPORT_SYMBOL(xfrm_register_km); +EXPORT_SYMBOL(xfrm_unregister_km); +EXPORT_SYMBOL(xfrm_state_delete); +EXPORT_SYMBOL(xfrm_state_walk); +EXPORT_SYMBOL(xfrm_find_acq_byseq); +EXPORT_SYMBOL(xfrm_find_acq); +EXPORT_SYMBOL(xfrm_alloc_spi); +EXPORT_SYMBOL(xfrm_state_flush); +EXPORT_SYMBOL(xfrm_policy_kill); +EXPORT_SYMBOL(xfrm_policy_bysel); +EXPORT_SYMBOL(xfrm_policy_insert); +EXPORT_SYMBOL(xfrm_policy_walk); +EXPORT_SYMBOL(xfrm_policy_flush); +EXPORT_SYMBOL(xfrm_policy_byid); +EXPORT_SYMBOL(xfrm_policy_list); +EXPORT_SYMBOL(xfrm_dst_lookup); +EXPORT_SYMBOL(xfrm_policy_register_afinfo); +EXPORT_SYMBOL(xfrm_policy_unregister_afinfo); +EXPORT_SYMBOL(xfrm_policy_get_afinfo); +EXPORT_SYMBOL(xfrm_policy_put_afinfo); + +EXPORT_SYMBOL_GPL(xfrm_probe_algs); +EXPORT_SYMBOL_GPL(xfrm_count_auth_supported); +EXPORT_SYMBOL_GPL(xfrm_count_enc_supported); +EXPORT_SYMBOL_GPL(xfrm_aalg_get_byidx); +EXPORT_SYMBOL_GPL(xfrm_ealg_get_byidx); +EXPORT_SYMBOL_GPL(xfrm_calg_get_byidx); +EXPORT_SYMBOL_GPL(xfrm_aalg_get_byid); +EXPORT_SYMBOL_GPL(xfrm_ealg_get_byid); +EXPORT_SYMBOL_GPL(xfrm_calg_get_byid); +EXPORT_SYMBOL_GPL(xfrm_aalg_get_byname); +EXPORT_SYMBOL_GPL(xfrm_ealg_get_byname); +EXPORT_SYMBOL_GPL(xfrm_calg_get_byname); + +EXPORT_SYMBOL_GPL(skb_icv_walk); diff -u linux-work/net/Kconfig-XFRM linux-work/net/Kconfig --- linux-work/net/Kconfig-XFRM 2003-05-27 03:00:21.000000000 +0200 +++ linux-work/net/Kconfig 2003-08-03 23:12:24.000000000 +0200 @@ -143,6 +143,7 @@ config NET_KEY tristate "PF_KEY sockets" + select XFRM ---help--- PF_KEYv2 socket family, compatible to KAME ones. They are required if you are going to use IPsec tools ported diff -u linux-work/net/netsyms.c-XFRM linux-work/net/netsyms.c --- linux-work/net/netsyms.c-XFRM 2003-07-28 23:12:33.000000000 +0200 +++ linux-work/net/netsyms.c 2003-08-03 23:16:23.000000000 +0200 @@ -56,7 +56,6 @@ #include #include #include -#include #if defined(CONFIG_INET_AH) || defined(CONFIG_INET_AH_MODULE) || defined(CONFIG_INET6_AH) || defined(CONFIG_INET6_AH_MODULE) #include #endif @@ -294,78 +293,6 @@ /* needed for ip_gre -cw */ EXPORT_SYMBOL(ip_statistics); -EXPORT_SYMBOL(xfrm_user_policy); -EXPORT_SYMBOL(km_waitq); -EXPORT_SYMBOL(km_new_mapping); -EXPORT_SYMBOL(xfrm_cfg_sem); -EXPORT_SYMBOL(xfrm_policy_alloc); -EXPORT_SYMBOL(__xfrm_policy_destroy); -EXPORT_SYMBOL(xfrm_lookup); -EXPORT_SYMBOL(__xfrm_policy_check); -EXPORT_SYMBOL(__xfrm_route_forward); -EXPORT_SYMBOL(xfrm_state_alloc); -EXPORT_SYMBOL(__xfrm_state_destroy); -EXPORT_SYMBOL(xfrm_state_find); -EXPORT_SYMBOL(xfrm_state_insert); -EXPORT_SYMBOL(xfrm_state_add); -EXPORT_SYMBOL(xfrm_state_update); -EXPORT_SYMBOL(xfrm_state_check_expire); -EXPORT_SYMBOL(xfrm_state_check_space); -EXPORT_SYMBOL(xfrm_state_lookup); -EXPORT_SYMBOL(xfrm_state_register_afinfo); -EXPORT_SYMBOL(xfrm_state_unregister_afinfo); -EXPORT_SYMBOL(xfrm_state_get_afinfo); -EXPORT_SYMBOL(xfrm_state_put_afinfo); -EXPORT_SYMBOL(xfrm_state_delete_tunnel); -EXPORT_SYMBOL(xfrm_replay_check); -EXPORT_SYMBOL(xfrm_replay_advance); -EXPORT_SYMBOL(xfrm_check_selectors); -EXPORT_SYMBOL(xfrm_check_output); -EXPORT_SYMBOL(__secpath_destroy); -EXPORT_SYMBOL(xfrm_get_acqseq); -EXPORT_SYMBOL(xfrm_parse_spi); -EXPORT_SYMBOL(xfrm4_rcv); -EXPORT_SYMBOL(xfrm4_tunnel_register); -EXPORT_SYMBOL(xfrm4_tunnel_deregister); -EXPORT_SYMBOL(xfrm4_tunnel_check_size); -EXPORT_SYMBOL(xfrm_register_type); -EXPORT_SYMBOL(xfrm_unregister_type); -EXPORT_SYMBOL(xfrm_get_type); -EXPORT_SYMBOL(inet_peer_idlock); -EXPORT_SYMBOL(xfrm_register_km); -EXPORT_SYMBOL(xfrm_unregister_km); -EXPORT_SYMBOL(xfrm_state_delete); -EXPORT_SYMBOL(xfrm_state_walk); -EXPORT_SYMBOL(xfrm_find_acq_byseq); -EXPORT_SYMBOL(xfrm_find_acq); -EXPORT_SYMBOL(xfrm_alloc_spi); -EXPORT_SYMBOL(xfrm_state_flush); -EXPORT_SYMBOL(xfrm_policy_kill); -EXPORT_SYMBOL(xfrm_policy_bysel); -EXPORT_SYMBOL(xfrm_policy_insert); -EXPORT_SYMBOL(xfrm_policy_walk); -EXPORT_SYMBOL(xfrm_policy_flush); -EXPORT_SYMBOL(xfrm_policy_byid); -EXPORT_SYMBOL(xfrm_policy_list); -EXPORT_SYMBOL(xfrm_dst_lookup); -EXPORT_SYMBOL(xfrm_policy_register_afinfo); -EXPORT_SYMBOL(xfrm_policy_unregister_afinfo); -EXPORT_SYMBOL(xfrm_policy_get_afinfo); -EXPORT_SYMBOL(xfrm_policy_put_afinfo); - -EXPORT_SYMBOL_GPL(xfrm_probe_algs); -EXPORT_SYMBOL_GPL(xfrm_count_auth_supported); -EXPORT_SYMBOL_GPL(xfrm_count_enc_supported); -EXPORT_SYMBOL_GPL(xfrm_aalg_get_byidx); -EXPORT_SYMBOL_GPL(xfrm_ealg_get_byidx); -EXPORT_SYMBOL_GPL(xfrm_calg_get_byidx); -EXPORT_SYMBOL_GPL(xfrm_aalg_get_byid); -EXPORT_SYMBOL_GPL(xfrm_ealg_get_byid); -EXPORT_SYMBOL_GPL(xfrm_calg_get_byid); -EXPORT_SYMBOL_GPL(xfrm_aalg_get_byname); -EXPORT_SYMBOL_GPL(xfrm_ealg_get_byname); -EXPORT_SYMBOL_GPL(xfrm_calg_get_byname); -EXPORT_SYMBOL_GPL(skb_icv_walk); #if defined(CONFIG_INET_ESP) || defined(CONFIG_INET_ESP_MODULE) || defined(CONFIG_INET6_ESP) || defined(CONFIG_INET6_ESP_MODULE) EXPORT_SYMBOL_GPL(skb_cow_data); EXPORT_SYMBOL_GPL(pskb_put); From nf@hipac.org Mon Aug 4 06:18:46 2003 Received: with ECARTIS (v1.0.0; list netdev); Mon, 04 Aug 2003 06:19:00 -0700 (PDT) Received: from indyio.rz.uni-saarland.de (indyio.rz.uni-saarland.de [134.96.7.3]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h74DIhFl019267 for ; Mon, 4 Aug 2003 06:18:45 -0700 Received: from mars.rz.uni-saarland.de (mars.rz.uni-saarland.de [134.96.7.4]) by indyio.rz.uni-saarland.de (8.12.9/8.12.5) with ESMTP id h74DIZqk6640013; Mon, 4 Aug 2003 15:18:35 +0200 (CEST) Received: from e002.stw.stud.uni-saarland.de (e002.stw.stud.uni-saarland.de [134.96.65.17]) by mars.rz.uni-saarland.de (8.9.3p2/8.8.4/8.8.2) with ESMTP id PAA26020101; Mon, 4 Aug 2003 15:18:34 +0200 (CEST) Received: from e226.stw.stud.uni-saarland.de ([134.96.65.241] helo=hipac.org) by e002.stw.stud.uni-saarland.de with esmtp (Exim 3.35 #1 (Debian)) id 19jfEQ-0003Qv-00; Mon, 04 Aug 2003 15:18:34 +0200 Message-ID: <3F2E5CD6.4030500@hipac.org> Date: Mon, 04 Aug 2003 15:17:10 +0200 From: Michael Bellion and Thomas Heinz User-Agent: Mozilla/5.0 (X11; U; Linux i686; de-AT; rv:1.4) Gecko/20030714 Debian/1.4-2 X-Accept-Language: de, en MIME-Version: 1.0 To: hadi@cyberus.ca CC: linux-net@vger.kernel.org, netdev@oss.sgi.com Subject: Re: [RFC] High Performance Packet Classifiction for tc framework References: <200307141045.40999.nf@hipac.org> <1058328537.1797.24.camel@jzny.localdomain> <3F16A0E5.1080007@hipac.org> <1059934468.1103.41.camel@jzny.localdomain> In-Reply-To: <1059934468.1103.41.camel@jzny.localdomain> X-Enigmail-Version: 0.76.2.0 X-Enigmail-Supports: pgp-inline, pgp-mime Content-Type: multipart/signed; micalg=pgp-sha1; protocol="application/pgp-signature"; boundary="------------enig1F772C011F16724D016A230F" X-archive-position: 4515 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: nf@hipac.org Precedence: bulk X-list: netdev This is an OpenPGP/MIME signed message (RFC 2440 and 3156) --------------enig1F772C011F16724D016A230F Content-Type: text/plain; charset=us-ascii; format=flowed Content-Transfer-Encoding: 7bit Hi Jamal You wrote: > Apologies for late response. Its funny how i thought i was going to have > more time in the last 2 weeks but due to bad scheduling that wasnt the > case. No problemo ... > I think i will have to look at your code to make comments. Curious about it. > Not entirely accurate. Depends which tc classifier. u32 hash tables are > infact like iptables chains. Hm, we don't think so. Unfortunately, there does not seem to be much information about the inner workings of u32 and we currently don't have the time to deduce the whole algorithm from the code. Here is a short overview of our current view on u32: - each u32 filter "rule" consists of possibly several u32 matches, i.e. tc_u32_sel with nkeys tc_u32_key's => one rule is basically represented as a struct tc_u_knode - a set of u32 filter rules with same priority is in general a tree of hashes like for example: hash1: |--|--| / \ hash2: |--|--|--| hash3: |--|--|--|--| | | | | | | | r1 r2 r3 r4 r5 r6 r7 where the r_i are in fact lists of rules (-> hashing with chaining) => if there is more than one single filter with same prio there is always a tree of hashes (with possibly only 1 node (=hash)) - within such a tree of u32 filters (with same prio) there is no concept of prioritizing them any further => the rules must be conflict free - there is no way of optimizing filters with different priorities since one cannot assume that the intermediate classifiers are all of the same type If this is not the way it _really_ works we'd appreciate it if you could describe the general principles behind u32. > Note, the concept of priorities which is used for conflict resolution as > well as further separating sets of rules doesnt exist in iptables. Well, iptables rule position and tc filter priorities are just the same apart from the fact that iptables does not allow multiple rules to have the same priority (read: position). Therefore iptables rulesets don't suffer from conflicts. > You can also have them use different priorities and with the continue > operator first clasify based on packet data then on metadata or on > another packet header filter. Ok but then you fall back to the linear approach. Since with u32 only blocks of rules with same prio can be optimized one has to implement a ruleset using as few different prioritized blocks of filters as possible to achieve maximum performance. >>One disadvantage of this concept is that the hashed filters >>must be compact, i.e. there cannot be other classifiers in between. > > I didnt understand this. Are you talking about conflict resolving of > overlapping filters? No, the issue is just that within a block of filters with same prio there cannot be another type of filter, e.g. one cannot put a route classifier inside a hash of u32 classifiers. >>Another major disadvantage is caused by the hashing scheme. >>If you use the hash for 1 dimension you have to make sure that >>either all filters in a certain bucket are disjoint or you must have >>an implicit ordering of the rules (according to the insertion order >>or something). This scheme is not extendable to 2 or more dimensions, >>i.e. 1 hash for src ip, #(src ip buckets) many dst ip hashes and so >>on, because you simply cannot express arbitrary rulesets. > > If i understood you - you are refering to a way to reduce the number of > lookups by having disjoint hashes. I suppose for something as simple as > a five tuple lookup, this is almost solvable by hardcoding the different > fields into multiway hashes. Its when you try to generalize that it > becomes an issue. What do you mean exactly by "five tuple"? Do you refer to rules which consist of 5 punctiform matches, i.e. no masks or ranges or wildcards allowed, like (src ip 1.2.3.4, dst ip 3.4.5.6, proto tcp, src port 123, dst port 456)? Of course the scheme works for such cases (which consist of non-conflicting rules) although the user must be aware of the concrete hash function: divisor & u32_hash_fold(key, sel) because the mask would be 0xffffffff for the ip's. If ranges or overlapping masks are involved it gets really complicated and we doubt that people are able to manage such scenarios. >>Another general problem is of course that the user has to manually >>setup the hash which is rather inconvenient. > > Yes. Take a look at Werners tcng - he has a clever way to hide things > from the user. I did experimentation on u32 with a kernel thread which > rearranged things when they seemed out of balance but i havent > experimented with a lot of rules. We had a look at the tcng paper. Here it says that the u32 classifier is not used in the optimal way. Since we didn't have a look at the current tcng release it might well be that these problems are already addressed. Is that the case? BTW, why do you want to rearrange the tree of hashes and based on which heuristics? Why is there a kernel thread needed? Isn't it possible to arrange the tree directly after insert/delete operations? >>Now, what are the implications on the matching performance: >>tc vs. nf-hipac? As long as the extended hashing stuff is not used >>nf-hipac is clearly superior to tc. > > You are refering to u32. You mean as long as u32 stored things in a > single linked list, you win - correct? Yes, but this is not only true for u32. As long as the ruleset looks like: "n filters with n different priorities which can be translated into n nf-hipac rules" nf-hipac is clearly faster because in this case tc uses the linear approach. >>When hashing is used it _really_ >>depends. If there is only one classifier (with hashing) per interface >>and the number of rules per bucket is very small the performance should >>be comparable. As soon as you add other classifiers nf-hipac will >>outperform tc again. > > If we take a simple user interface abstraction like tcng which hides the > evil of u32 and then take simple 5 tuple rules - i doubt you will see > any difference. For more generic setup, the kernel thread i refer to > would work - but may slow insertion. For the simple punctiform examples like described above you may be right that nf-hipac and tc should perform similar but it's not clear to us how you want to achieve universality (including mask, ranges and wildcards) by this kernel thread rearranging approach. Basically you have to address the following problem: Given an arbitrary set of u32 rules with different priorities you have to compute an semantically equivalent representation with a tree of hashes. >>So, basically HIPAC is just a normal classifier like any other >>with two exceptions: >> a) it can occur only once per interface >> b) the rules within the classifier can contain other classifiers, >> e.g. u32, fw, tc_index, as matches > > But why restriction a)? Well, the restriction is necessary because of the new hipac design in which nf-hipac (i.e. firewalling), routing and cls_hipac (i.e. tc) are just applications for the classification framework. The basic idea is to reduce the number of classifications on the forwarding path to a single one (in the best case). In order to truly understand the requirement it would be necessary to explain the idea behind the new stage concept which is beyond the scope of this e-mail :-/. > Also why should we need hipac to hold other filters when the > infrastructure itself can hold the extended filters just fine? > I think you may actually be trying to say why somewhere in the email, > but it must not be making a significant impression on my brain. The idea is to reduce the embedded classifiers to a match, i.e. their return value is ignored. This offers the possibility of expressing a conjunction of native matches and classifiers in the very same way nf-hipac rules support iptables matches. This enhances the expressiveness of classification rules. A rule |nat. match 1|...|nat. match n|emb. cls 1|...|emb. cls m| matches if nat. match 1-n and emb. cls 1-m match. >>There is just one problem with the current tc framework. Once >>a new filter is inserted into the chain it is not removed even >>if the change function of the classifier returns < 0 >>(2.6.0-test1: net/sched/cls_api.c: line 280f). >>This should be changed anyway, shouldn't it? > > Are you refering to this piece of code?: > ---- > err = tp->ops->change(tp, cl, t->tcm_handle, tca, &fh); > if (err == 0) > tfilter_notify(skb, n, tp, fh, RTM_NEWTFILTER); > > errout: > if (cl) > cops->put(q, cl); > return err; > --- Yes. > change() should not return <0 if it has installed the filter i think. > Should the top level code be responsible for removing filters? The top level code (cls_hipac.c:tc_ctl_filter) is responsible for creating new tcf_proto structs (if not existent) and enqueuing the struct into the chain. Therefore it is also responsible for taking the stuff out of the chain again if necessary. In case we have just created a new tcf_proto and change fails it would be better if the new tcf_proto is removed afterwards, i.e. write_lock(&qdisc_tree_lock); spin_lock_bh(&dev->queue_lock); *back = tp->next; spin_unlock_bh(&dev->queue_lock); write_unlock(&qdisc_tree_lock); tp->ops->destroy(tp); module_put(tp->ops->owner); kfree(tp); is issued. Do you agree? > Consider what i said above. I'll try n cobble together some examples to > demonstrate (although it seems you already know this). > To allow for anyone to install classifiers-du-jour without being > dependet on hipac would be very useful. So ideas that you have for > enabling this cleanly should be moved to cls_api. Nobody will be forced to use hipac :-). It's just another classifier like u32. We don't even had to modify cls_api so far. Everything integrates just fine. Regards, +-----------------------+----------------------+ | Michael Bellion | Thomas Heinz | | | | +-----------------------+----------------------+ | High Performance Packet Classification | | nf-hipac: http://www.hipac.org/ | +----------------------------------------------+ --------------enig1F772C011F16724D016A230F Content-Type: application/pgp-signature -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.2.2 (GNU/Linux) Comment: Using GnuPG with Debian - http://enigmail.mozdev.org iD8DBQE/LlzdtXh2AYIMjggRAtcvAKCUZykozfMnI5MmRMo0j/zH6TDg7gCdGl20 ngF9kmhPF45vfAYjTq6sd/U= =qy5Z -----END PGP SIGNATURE----- --------------enig1F772C011F16724D016A230F-- From niv@us.ibm.com Mon Aug 4 08:51:02 2003 Received: with ECARTIS (v1.0.0; list netdev); Mon, 04 Aug 2003 08:51:07 -0700 (PDT) Received: from e33.co.us.ibm.com (e33.co.us.ibm.com [32.97.110.131]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h74FotFl021660 for ; Mon, 4 Aug 2003 08:51:02 -0700 Received: from westrelay02.boulder.ibm.com (westrelay02.boulder.ibm.com [9.17.195.11]) by e33.co.us.ibm.com (8.12.9/8.12.2) with ESMTP id h74FoHj3303868; Mon, 4 Aug 2003 11:50:17 -0400 Received: from us.ibm.com (d03av02.boulder.ibm.com [9.17.193.82]) by westrelay02.boulder.ibm.com (8.12.9/NCO/VER6.5) with ESMTP id h74FoGiQ067578; Mon, 4 Aug 2003 09:50:17 -0600 Message-ID: <3F2E80CD.3090206@us.ibm.com> Date: Mon, 04 Aug 2003 08:50:37 -0700 From: Nivedita Singhvi User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.0; en-US; rv:1.2.1) Gecko/20021130 X-Accept-Language: en-us, en MIME-Version: 1.0 To: Andi Kleen CC: netdev@oss.sgi.com Subject: Re: [PATCH] Make XFRM optional References: <20030804125022.GA8167@averell> In-Reply-To: <20030804125022.GA8167@averell> Content-Type: text/plain; charset=us-ascii; format=flowed Content-Transfer-Encoding: 7bit X-archive-position: 4516 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: niv@us.ibm.com Precedence: bulk X-list: netdev Andi Kleen wrote: > Only compile in the xfrm subsystem when it's needed by any config options. > > This avoids some code/data structure bloat in case you don't use IP > tunneling or IPsec. Yes, I would like this too, please. thanks, Nivedita From hadi@cyberus.ca Mon Aug 4 08:51:44 2003 Received: with ECARTIS (v1.0.0; list netdev); Mon, 04 Aug 2003 08:51:53 -0700 (PDT) Received: from mail.cyberus.ca (mail.cyberus.ca [209.195.118.111]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h74FpgFl021732 for ; Mon, 4 Aug 2003 08:51:43 -0700 Received: from cpe0030ab124d2f-cm014500000962.cpe.net.cable.rogers.com ([24.103.99.32] helo=[10.0.0.9]) by mail.cyberus.ca with esmtp (Exim 4.12) id 19jhcb-000EBD-00; Mon, 04 Aug 2003 11:51:41 -0400 Subject: Re: [RFC] High Performance Packet Classifiction for tc framework From: jamal Reply-To: hadi@cyberus.ca To: Michael Bellion and Thomas Heinz Cc: linux-net@vger.kernel.org, netdev@oss.sgi.com In-Reply-To: <3F2E5CD6.4030500@hipac.org> References: <200307141045.40999.nf@hipac.org> <1058328537.1797.24.camel@jzny.localdomain> <3F16A0E5.1080007@hipac.org> <1059934468.1103.41.camel@jzny.localdomain> <3F2E5CD6.4030500@hipac.org> Content-Type: text/plain Organization: jamalopolis Message-Id: <1060012260.1103.380.camel@jzny.localdomain> Mime-Version: 1.0 X-Mailer: Ximian Evolution 1.2.2 Date: 04 Aug 2003 11:51:01 -0400 Content-Transfer-Encoding: 7bit X-archive-position: 4517 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: hadi@cyberus.ca Precedence: bulk X-list: netdev Olla, On Mon, 2003-08-04 at 09:17, Michael Bellion and Thomas Heinz wrote: > > I think i will have to look at your code to make comments. > > Curious about it. > I promise i will. I dont think i will do it justice spending 5 minutes on it. I take it you have written extensive docs too ;-> > > Not entirely accurate. Depends which tc classifier. u32 hash tables are > > infact like iptables chains. > > Hm, we don't think so. Unfortunately, there does not seem to be much > information about the inner workings of u32 and we currently don't have > the time to deduce the whole algorithm from the code. > Unfortunately it is more exciting to write code than documents. I almost got someone to document at least its proper usage but they backed away at the last minute. > Here is a short overview of our current view on u32: > - each u32 filter "rule" consists of possibly several u32 matches, > i.e. tc_u32_sel with nkeys tc_u32_key's > => one rule is basically represented as a struct tc_u_knode > - a set of u32 filter rules with same priority is in general a > tree of hashes like for example: > hash1: |--|--| > / \ > hash2: |--|--|--| hash3: |--|--|--|--| > | | | | | | | > r1 r2 r3 r4 r5 r6 r7 > where the r_i are in fact lists of rules (-> hashing with > chaining) > => if there is more than one single filter with same prio > there is always a tree of hashes (with possibly only 1 node > (=hash)) > - within such a tree of u32 filters (with same prio) there is > no concept of prioritizing them any further => the rules must > be conflict free > - there is no way of optimizing filters with different priorities > since one cannot assume that the intermediate classifiers are all > of the same type > > If this is not the way it _really_ works we'd appreciate it if you could > describe the general principles behind u32. > u32 is a swiss knife so to go into general principles requires some time, motivation, and more importantly patience.I possess none of these nice attributes at the moment. You are doing a good job keep reading the code. I dont wanna go in a lot of details, but one important detail is that keynodes can also lead to other hash tables. So you can split the packet parsing across multiple hashes - this is where the comparison with chains comes in. There are several ways to do this. I'll show you the brute force way and you can make it more usable with "hashkey" and "sample" operator. Stealing from your example: hash1: |--|--| / hash2: |--|--|--| | | | r1 r2 r3 | | hash3 hash4 | | r4 r5 Example, you go into hash2 for all IP packets. The rules on the hash2 look at the protocol type and select a different hash table for TCP, UDP, ICMP etc. - so general rules is: Put your most hit rules at the highest priority so they are found first. Heres an example, i havent tested this (i can send you a tested example if you cant get this to work): ------- TCF=tc filter add dev eth0 parent ffff: protocol ip prio 10 # add hash table 1 $TCF handle 1::: u32 divisor 1 #add hash table 2 $TCF handle 2::: u32 divisor 1 #add your filter rules to specific tables: ICMP to table 1, TCP to table #6 etc . . #ICMP gets matched in table 1 $TCF match ip protocol 1 0xff link 1:0:0 . . ---------- Makes sense? Note, this doesnt say much about the user usability of u32 - it just says can be done. > > Note, the concept of priorities which is used for conflict resolution as > > well as further separating sets of rules doesnt exist in iptables. > > Well, iptables rule position and tc filter priorities are just the > same apart from the fact that iptables does not allow multiple rules > to have the same priority (read: position). Therefore iptables rulesets > don't suffer from conflicts. > sure position could be used as a priority. It is easier/intuitive to just have explicit priorities. > > You can also have them use different priorities and with the continue > > operator first clasify based on packet data then on metadata or on > > another packet header filter. > > Ok but then you fall back to the linear approach. Since with u32 only > blocks of rules with same prio can be optimized one has to implement a > ruleset using as few different prioritized blocks of filters as possible > to achieve maximum performance. > Read what i said above if you still hold the same opinion lets discuss. What "optimizes" could be a user interface or the thread i was talking about earlier. > >>One disadvantage of this concept is that the hashed filters > >>must be compact, i.e. there cannot be other classifiers in between. > > > > I didnt understand this. Are you talking about conflict resolving of > > overlapping filters? > > No, the issue is just that within a block of filters with same prio > there cannot be another type of filter, e.g. one cannot put a route > classifier inside a hash of u32 classifiers. > But you dont need to as i was pointing out earlier. You can have both fwmark,tcindex,u32, rsvp etc being invoked one after the other. > >>Another major disadvantage is caused by the hashing scheme. > >>If you use the hash for 1 dimension you have to make sure that > >>either all filters in a certain bucket are disjoint or you must have > >>an implicit ordering of the rules (according to the insertion order > >>or something). This scheme is not extendable to 2 or more dimensions, > >>i.e. 1 hash for src ip, #(src ip buckets) many dst ip hashes and so > >>on, because you simply cannot express arbitrary rulesets. > > > > If i understood you - you are refering to a way to reduce the number of > > lookups by having disjoint hashes. I suppose for something as simple as > > a five tuple lookup, this is almost solvable by hardcoding the different > > fields into multiway hashes. Its when you try to generalize that it > > becomes an issue. > > What do you mean exactly by "five tuple"? Do you refer to rules which > consist of 5 punctiform matches, i.e. no masks or ranges or wildcards > allowed, like (src ip 1.2.3.4, dst ip 3.4.5.6, proto tcp, src port 123, > dst port 456)? > above but with masks. "5 tuple" is a classical name for the above. > Of course the scheme works for such cases (which consist of > non-conflicting rules) although the user must be aware of the > concrete hash function: divisor & u32_hash_fold(key, sel) > because the mask would be 0xffffffff for the ip's. > > If ranges or overlapping masks are involved it gets really complicated > and we doubt that people are able to manage such scenarios. > I was refering to the cascaded hash tables i was refering to earlier. Depending on the rules, you could optimize differently. > >>Another general problem is of course that the user has to manually > >>setup the hash which is rather inconvenient. > > > > Yes. Take a look at Werners tcng - he has a clever way to hide things > > from the user. I did experimentation on u32 with a kernel thread which > > rearranged things when they seemed out of balance but i havent > > experimented with a lot of rules. > > We had a look at the tcng paper. Here it says that the u32 classifier > is not used in the optimal way. Since we didn't have a look at the > current tcng release it might well be that these problems are already > addressed. Is that the case? > He doesnt fix the u32, rather if you use his wrappers he outputs optimized u32 rules. All that is hidden from the user. > BTW, why do you want to rearrange the tree of hashes and based on which > heuristics? Why is there a kernel thread needed? Isn't it possible to > arrange the tree directly after insert/delete operations? > You can do that, but then you are adding delay to the insertion/deletion rates which are very important metrics. Another way to do it is to fire a netlink message every time a hash table's keynodes exceed a threshold value and have user space compute a rearrangement. Essentially you have to weigh your tradeoffs. > >>Now, what are the implications on the matching performance: > >>tc vs. nf-hipac? As long as the extended hashing stuff is not used > >>nf-hipac is clearly superior to tc. > > > > You are refering to u32. You mean as long as u32 stored things in a > > single linked list, you win - correct? > > Yes, but this is not only true for u32. As long as the ruleset > looks like: "n filters with n different priorities which can > be translated into n nf-hipac rules" nf-hipac is clearly faster > because in this case tc uses the linear approach. > If you still hold this opinion after my explanation on cascaded hash tables, then lets discuss again. > >>When hashing is used it _really_ > >>depends. If there is only one classifier (with hashing) per interface > >>and the number of rules per bucket is very small the performance should > >>be comparable. As soon as you add other classifiers nf-hipac will > >>outperform tc again. > > > > If we take a simple user interface abstraction like tcng which hides the > > evil of u32 and then take simple 5 tuple rules - i doubt you will see > > any difference. For more generic setup, the kernel thread i refer to > > would work - but may slow insertion. > > For the simple punctiform examples like described above you may be right > that nf-hipac and tc should perform similar but it's not clear to us > how you want to achieve universality (including mask, ranges and > wildcards) by this kernel thread rearranging approach. Basically you > have to address the following problem: Given an arbitrary set of u32 > rules with different priorities you have to compute an semantically > equivalent representation with a tree of hashes. > yes - that is the challenge to resolve;-> > >>So, basically HIPAC is just a normal classifier like any other > >>with two exceptions: > >> a) it can occur only once per interface > >> b) the rules within the classifier can contain other classifiers, > >> e.g. u32, fw, tc_index, as matches > > > > But why restriction a)? > > Well, the restriction is necessary because of the new hipac design in > which nf-hipac (i.e. firewalling), routing and cls_hipac (i.e. tc) are > just applications for the classification framework. The basic idea is > to reduce the number of classifications on the forwarding path to a > single one (in the best case). In order to truly understand the > requirement it would be necessary to explain the idea behind the new > stage concept which is beyond the scope of this e-mail :-/. > Ok - maybe when you explain the concept later i will get it. Is your plan to put this in other places other than Linux? > > Also why should we need hipac to hold other filters when the > > infrastructure itself can hold the extended filters just fine? > > I think you may actually be trying to say why somewhere in the email, > > but it must not be making a significant impression on my brain. > > The idea is to reduce the embedded classifiers to a match, i.e. > their return value is ignored. This offers the possibility of > expressing a conjunction of native matches and classifiers in the > very same way nf-hipac rules support iptables matches. This enhances > the expressiveness of classification rules. > A rule |nat. match 1|...|nat. match n|emb. cls 1|...|emb. cls m| > matches if nat. match 1-n and emb. cls 1-m match. > So you got this thought from iptables and took it to the next level? I am still not sure i understand why not use what already exists - but i'll just say i dont see it right now. > > The top level code (cls_hipac.c:tc_ctl_filter) is responsible for > creating new tcf_proto structs (if not existent) and enqueuing the > struct into the chain. Therefore it is also responsible for taking > the stuff out of the chain again if necessary. In case we have just > created a new tcf_proto and change fails it would be better if the new > tcf_proto is removed afterwards, i.e. > write_lock(&qdisc_tree_lock); > spin_lock_bh(&dev->queue_lock); > *back = tp->next; > spin_unlock_bh(&dev->queue_lock); > write_unlock(&qdisc_tree_lock); > tp->ops->destroy(tp); > module_put(tp->ops->owner); > kfree(tp); > is issued. > Do you agree? > It doesnt appear harmful to leave it there without destroying it. The next time someome adds a filter of the same protocol + priority, it will already exist. If you want to be accurate (because it does get destroyed when the init() fails), then destroy it but you need to put checks for "incase we have added a new tcf_proto" which may not look pretty. Is this causing you some discomfort? > > Consider what i said above. I'll try n cobble together some examples to > > demonstrate (although it seems you already know this). > > To allow for anyone to install classifiers-du-jour without being > > dependet on hipac would be very useful. So ideas that you have for > > enabling this cleanly should be moved to cls_api. > > Nobody will be forced to use hipac :-). It's just another classifier > like u32. We don't even had to modify cls_api so far. Everything > integrates just fine. > cool. Keep up the good work. cheers, jamal From mathis@psc.edu Mon Aug 4 09:21:04 2003 Received: with ECARTIS (v1.0.0; list netdev); Mon, 04 Aug 2003 09:21:12 -0700 (PDT) Received: from zippy.psc.edu (pa-monroeville3a-31.pit.adelphia.net [24.53.185.31]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h74GL2Fl023294 for ; Mon, 4 Aug 2003 09:21:03 -0700 Received: from localhost (mathis@localhost) by zippy.psc.edu (8.11.6/8.11.6) with ESMTP id h74GKlB27764; Mon, 4 Aug 2003 12:20:47 -0400 X-Authentication-Warning: zippy.psc.edu: mathis owned process doing -bs Date: Mon, 4 Aug 2003 12:20:47 -0400 (EDT) From: Matt Mathis To: "David S. Miller" cc: netdev@oss.sgi.com, John Heffner Subject: Web100 In-Reply-To: <20030803222554.7027a160.davem@redhat.com> Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII X-archive-position: 4518 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: mathis@psc.edu Precedence: bulk X-list: netdev On Sun, 3 Aug 2003, David S. Miller wrote: > The web100 patches aren't in the kernel because 1) they've > never even been submitted and 2) they need a large cleanup. Furthermore 1 is due to 2.... We know our code is not ready for kernel inclusion, and are having a little trouble seeing the path through to doing so ourselves. A big part of the problem is that I an not a kernel guy - my focus in on the protocol and measurement issues and not on the implementation details. Although John could probably get it together by himself, he is split between several projects and it isn't clear that incrementally submitting substandard patches is a cost effective strategy to getting it done. It would be a lot easier if we 1) had a mentor who was experienced at kernel inclusion, 2) specific guidance on some of the non-network components, such as the API (currently using /proc) and 3) a laundry list of things that we need to fix. > I sort of get the impression that the web100 folks actually like that > their changes are not in the main sources, it keeps their work > "special". Nope, not at all. Actually I find kernel inclusion rather daunting. One of our collaborators was asked some very pointed questions about the TCP ESTATS MIB by somebody at M$. I would hate to have the first general release be in anything but Linux. Any takers on helping us? Thanks, --MM-- ------------------------------------------- Matt Mathis http://www.psc.edu/~mathis Work:412.268.3319 Home/Cell:412.654.7529 ------------------------------------------- From hadi@cyberus.ca Mon Aug 4 09:46:00 2003 Received: with ECARTIS (v1.0.0; list netdev); Mon, 04 Aug 2003 09:46:08 -0700 (PDT) Received: from mail.cyberus.ca (mail.cyberus.ca [209.195.118.111]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h74GjxFl024775 for ; Mon, 4 Aug 2003 09:46:00 -0700 Received: from cpe0030ab124d2f-cm014500000962.cpe.net.cable.rogers.com ([24.103.99.32] helo=[10.0.0.9]) by mail.cyberus.ca with esmtp (Exim 4.12) id 19jiT8-000JAa-00; Mon, 04 Aug 2003 12:45:59 -0400 Subject: Re: TOE brain dump From: jamal Reply-To: hadi@cyberus.ca To: netdev@oss.sgi.com Cc: "Ihar 'Philips' Filipau" Content-Type: text/plain Organization: jamalopolis Message-Id: <1060015518.1103.399.camel@jzny.localdomain> Mime-Version: 1.0 X-Mailer: Ximian Evolution 1.2.2 Date: 04 Aug 2003 12:45:18 -0400 Content-Transfer-Encoding: 7bit X-archive-position: 4519 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: hadi@cyberus.ca Precedence: bulk X-list: netdev Can you please post to netdev? Posting networking related issues to linux kernel alone is considered rude. Posting them to netdev only is acceptable. > Ihar 'Philips' Filipau wrote: > > >Werner Almesberger wrote: > > Ihar 'Philips' Filipau wrote: > > > >| | | Modern NPUs generally do this. > > > > > > Unfortunately, they don't - they run *some* code, but that > > is rarely a Linux kernel, or a substantial part of it. > > > > Embedded CPU we are using is based MIPS, and has a lot of specialized > instructions. > It makes not that much sense to run kernel (especially Linux) on CPU > which is optimized for handling of network packets. (And has actually > several co-processors to help in this task). The coprocessors are useful, but that has nothing to do with the value of the NPU. You can add those within a general processor system. I am also in the camp that to be really useful these things need to run a real OS - Linux. > How much sense it makes to run general purpose OS (optimized for PCs > and servers) on devices which can make only couple of functions? (and no > MMU btw) > > It is a whole idea behind this kind of CPUs - to do a few of > functions - but to do them good. > > If you will start stretching CPUs like this to fit Linux kernel - it > will generally just increase price. Probably there are some markets > which can afford this. > Actually i believe it will lower the prices.I am waiting for intel to get hyperthreading right - then we'll see these things disapear. The only thing useful about NPUs is their ability to management the discrepency between memory latency and CPU speeds. Trust me i used to be in the same camp as you.If you note, a lot of these things appeared around the height of the .com days. VCs were looking for something new and exciting. > Remeber - "Small is beatiful" (c) - and linux kernel far from it. > Our routing code which handles two GE interfaces (actually not pure > GE, but up to 2.5GB) fits into 3k. 3k of code - and that's it. not 650kb > of bzip compressed bloat. And it handles two interfaces, handles fast > data path from siblign interfaces, handles up to 1E6 routes. 3k of code. > not 650k of bzip. If all you wanted was to do L3 - why not just buy a $5 chip that can do this for a lot more interfaces? Why sweat over optimizing L3 routing in a 3K space? to nit: Its no longer about routing or bridging, friend. Thats like getting fries at mcdonalds. cheers, jamal From alan@storlinksemi.com Mon Aug 4 10:19:39 2003 Received: with ECARTIS (v1.0.0; list netdev); Mon, 04 Aug 2003 10:19:49 -0700 (PDT) Received: from smtp016.mail.yahoo.com (smtp016.mail.yahoo.com [216.136.174.113]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h74HJdFl025593 for ; Mon, 4 Aug 2003 10:19:39 -0700 Received: from adsl-63-203-236-74.dsl.snfc21.pacbell.net (HELO AlanLap) (alansuntzishih@63.203.236.74 with login) by smtp.mail.vip.sc5.yahoo.com with SMTP; 4 Aug 2003 17:19:38 -0000 From: "Alan Shih" To: "Ingo Oeser" , "Jeff Garzik" Cc: "Nivedita Singhvi" , "Werner Almesberger" , , Subject: RE: TOE brain dump Date: Mon, 4 Aug 2003 10:19:21 -0700 Message-ID: MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit X-Priority: 3 (Normal) X-MSMail-Priority: Normal X-Mailer: Microsoft Outlook IMO, Build 9.0.2416 (9.0.2911.0) In-Reply-To: <20030804163606.Q639@nightmaster.csn.tu-chemnitz.de> Importance: Normal X-MimeOLE: Produced By Microsoft MimeOLE V6.00.2727.1300 X-archive-position: 4520 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: alan@storlinksemi.com Precedence: bulk X-list: netdev Hmm, So would main processor still need a copy of the data for re-transmission? Won't that defeat the purpose? Alan -----Original Message----- From: linux-kernel-owner@vger.kernel.org [mailto:linux-kernel-owner@vger.kernel.org]On Behalf Of Ingo Oeser Sent: Monday, August 04, 2003 7:36 AM To: Jeff Garzik Cc: Nivedita Singhvi; Werner Almesberger; netdev@oss.sgi.com; linux-kernel@vger.kernel.org Subject: Re: TOE brain dump Hi Jeff, On Sat, Aug 02, 2003 at 03:08:52PM -0400, Jeff Garzik wrote: > So, fix the other end of the pipeline too, otherwise this fast network > stuff is flashly but pointless. If you want to serve up data from disk, > then start creating PCI cards that have both Serial ATA and ethernet > connectors on them :) Cut out the middleman of the host CPU and host > memory bus instead of offloading portions of TCP that do not need to be > offloaded. Exactly what I suggested: sys_ioroute() "Providing generic pipelines and io routing as Linux service" Msg-ID: <20030718134235.K639@nightmaster.csn.tu-chemnitz.de> on linux-kernel and linux-fsdevel Be my guest. I know, that you mean doing it in hardware, but you cannot accelerate sth. which the kernel doesn't do ;-) Regards Ingo Oeser - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ From inaky.perez-gonzalez@intel.com Mon Aug 4 11:36:24 2003 Received: with ECARTIS (v1.0.0; list netdev); Mon, 04 Aug 2003 11:36:32 -0700 (PDT) Received: from caduceus.jf.intel.com (fmr06.intel.com [134.134.136.7]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h74IaNFl027540 for ; Mon, 4 Aug 2003 11:36:24 -0700 Received: from talaria.jf.intel.com (talaria.jf.intel.com [10.7.209.7]) by caduceus.jf.intel.com (8.11.6p2/8.11.6/d: outer.mc,v 1.66 2003/05/22 21:17:36 rfjohns1 Exp $) with ESMTP id h74IUIQ07776 for ; Mon, 4 Aug 2003 18:30:18 GMT Received: from orsmsxvs041.jf.intel.com (orsmsxvs041.jf.intel.com [192.168.65.54]) by talaria.jf.intel.com (8.11.6p2/8.11.6/d: inner.mc,v 1.35 2003/05/22 21:18:01 rfjohns1 Exp $) with SMTP id h74Hxgl00580 for ; Mon, 4 Aug 2003 17:59:42 GMT Received: from orsmsx332.amr.corp.intel.com ([192.168.65.60]) by orsmsxvs041.jf.intel.com (NAVGW 2.5.2.11) with SMTP id M2003080411361613752 ; Mon, 04 Aug 2003 11:36:16 -0700 Received: from orsmsx409.amr.corp.intel.com ([192.168.65.58]) by orsmsx332.amr.corp.intel.com with Microsoft SMTPSVC(5.0.2195.5329); Mon, 4 Aug 2003 11:36:16 -0700 content-class: urn:content-classes:message MIME-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" X-MimeOLE: Produced By Microsoft Exchange V6.0.6375.0 Subject: RE: TOE brain dump Date: Mon, 4 Aug 2003 11:36:15 -0700 Message-ID: X-MS-Has-Attach: X-MS-TNEF-Correlator: Thread-Topic: TOE brain dump Thread-Index: AcNZ/oaUAcgg0owhSWG6t+TBx1MScAAbhUUA From: "Perez-Gonzalez, Inaky" To: "Larry McVoy" , "David Lang" Cc: "Erik Andersen" , "Werner Almesberger" , "Jeff Garzik" , , , "Nivedita Singhvi" X-OriginalArrivalTime: 04 Aug 2003 18:36:16.0508 (UTC) FILETIME=[4A1793C0:01C35AB7] Content-Transfer-Encoding: 8bit X-MIME-Autoconverted: from quoted-printable to 8bit by oss.sgi.com id h74IaNFl027540 X-archive-position: 4521 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: inaky.perez-gonzalez@intel.com Precedence: bulk X-list: netdev > From: Larry McVoy [mailto:lm@bitmover.com] > > > 2. router nodes that have access to main memory (PCI card running linux > > acting as a router/firewall/VPN to offload the main CPU's) > > I can get an entire machine, memory, disk, > Ghz CPU, case, power supply, > cdrom, floppy, onboard enet extra net card for routing, for $250 or less, > quantity 1, shipped to my door. > > Why would I want to spend money on some silly offload card when I can get > the whole PC for less than the card? Because you want to stack 200 of those together in a huge data center interconnecting whatever you want to interconnect and you don't want your maintenance costs to go up to the sky? I see your point, though :) Iñaky Pérez-González -- Not speaking for Intel -- all opinions are my own (and my fault) From filia@softhome.net Mon Aug 4 11:47:45 2003 Received: with ECARTIS (v1.0.0; list netdev); Mon, 04 Aug 2003 11:47:53 -0700 (PDT) Received: from jive.SoftHome.net (jive.SoftHome.net [66.54.152.27]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h74IliFl028072 for ; Mon, 4 Aug 2003 11:47:45 -0700 Received: (qmail 8633 invoked by uid 417); 4 Aug 2003 18:47:44 -0000 Received: from shunt-smtp-out-0 (HELO softhome.net) (172.16.3.12) by shunt-smtp-out-0 with SMTP; 4 Aug 2003 18:47:44 -0000 Received: from softhome.net ([212.18.200.6]) (AUTH: PLAIN filia@softhome.net) by softhome.net with esmtp; Mon, 04 Aug 2003 12:47:42 -0600 Message-ID: <3F2EAA78.60202@softhome.net> Date: Mon, 04 Aug 2003 20:48:24 +0200 From: "Ihar 'Philips' Filipau" Organization: Home Sweet Home User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.4) Gecko/20030701 X-Accept-Language: en-us, en MIME-Version: 1.0 To: hadi@cyberus.ca CC: netdev@oss.sgi.com Subject: Re: TOE brain dump References: <1060015518.1103.399.camel@jzny.localdomain> In-Reply-To: <1060015518.1103.399.camel@jzny.localdomain> Content-Type: text/plain; charset=us-ascii; format=flowed Content-Transfer-Encoding: 7bit X-archive-position: 4522 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: filia@softhome.net Precedence: bulk X-list: netdev jamal wrote: > to nit: Its no longer about routing or bridging, friend. Thats like getting > fries at mcdonalds. > 1GE/10GE - for $5? I'm first in the shoping queue!!!-))) Since I see no reasonable out-come of this discussion I left it. TOE as I see - since my company utilizes several of them - are too different and too specialized to application/protocols. And yes - price of development/deployment maters too. Linux support for those protocols is inmature. It cannot handle or requirements even software-wise. I'm not talking about timing requirements - linux network in general is not (even soft) real-time. My personal flame-meter is out of scale ;-) I shall join the discussion back when I will see any real ideas. > If all you wanted was to do L3 - why not just buy a $5 chip that > can do this for a lot more interfaces? Why sweat over > optimizing L3 routing in a 3K space? We are doing not a teapot, and high level spec for this code takes around 15 pages. 3k - it is not optimized - we have limit around 2GB ;-) It just takes only 3k. And it handles some special (read - proprietary) functions too - some bugs of some other pieces of hardware. NPU does all stuff by itself, but sometimes we need to extract configuration information which is direct to us, for example. From davem@redhat.com Mon Aug 4 11:49:35 2003 Received: with ECARTIS (v1.0.0; list netdev); Mon, 04 Aug 2003 11:49:39 -0700 (PDT) Received: from pizda.ninka.net (IDENT:root@pizda.ninka.net [216.101.162.242]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h74InYFl028410 for ; Mon, 4 Aug 2003 11:49:35 -0700 Received: from pizda.ninka.net (IDENT:davem@localhost.localdomain [127.0.0.1]) by pizda.ninka.net (8.9.3/8.9.3) with SMTP id LAA20777; Mon, 4 Aug 2003 11:45:07 -0700 Date: Mon, 4 Aug 2003 11:45:07 -0700 From: "David S. Miller" To: Andi Kleen Cc: yoshfuji@linux-ipv6.org, ak@muc.de, netdev@oss.sgi.com Subject: Re: [PATCH] Make XFRM optional Message-Id: <20030804114507.6e496c77.davem@redhat.com> In-Reply-To: <20030804130408.GA36367@colin2.muc.de> References: <20030804125022.GA8167@averell> <20030804.215801.124854897.yoshfuji@linux-ipv6.org> <20030804130408.GA36367@colin2.muc.de> X-Mailer: Sylpheed version 0.9.2 (GTK+ 1.2.6; sparc-unknown-linux-gnu) Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit X-archive-position: 4523 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: davem@redhat.com Precedence: bulk X-list: netdev On 4 Aug 2003 15:04:08 +0200 Andi Kleen wrote: > Thanks for the feedback. Here is a new patch with the two hunks > removed. Still broken in two areas: 1) You moved inet_peer_idlock into net/xfrm/xfrm_exports.c, that looks quite wrong. 2) Your patch doesn't apply to Linus's current tree because "secpath_dup" got added to net/netsyms.c since 2.6.0-test2 got released. I wanted to merge this, but I can't until you fix the above problems. Thanks. From alan@lxorguk.ukuu.org.uk Mon Aug 4 12:07:26 2003 Received: with ECARTIS (v1.0.0; list netdev); Mon, 04 Aug 2003 12:07:30 -0700 (PDT) Received: from lxorguk.ukuu.org.uk (pc1-cwma1-5-cust4.swan.cable.ntl.com [80.5.120.4]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h74J7OFl029104 for ; Mon, 4 Aug 2003 12:07:26 -0700 Received: from dhcp22.swansea.linux.org.uk (dhcp22.swansea.linux.org.uk [127.0.0.1]) by lxorguk.ukuu.org.uk (8.12.8/8.12.5) with ESMTP id h74J3EC3001142; Mon, 4 Aug 2003 20:03:15 +0100 Received: (from alan@localhost) by dhcp22.swansea.linux.org.uk (8.12.8/8.12.8/Submit) id h74J3BPF001140; Mon, 4 Aug 2003 20:03:11 +0100 X-Authentication-Warning: dhcp22.swansea.linux.org.uk: alan set sender to alan@lxorguk.ukuu.org.uk using -f Subject: RE: TOE brain dump From: Alan Cox To: "Perez-Gonzalez, Inaky" Cc: Larry McVoy , David Lang , Erik Andersen , Werner Almesberger , Jeff Garzik , netdev@oss.sgi.com, Linux Kernel Mailing List , Nivedita Singhvi In-Reply-To: References: Content-Type: text/plain Content-Transfer-Encoding: 7bit Organization: Message-Id: <1060023790.723.23.camel@dhcp22.swansea.linux.org.uk> Mime-Version: 1.0 X-Mailer: Ximian Evolution 1.2.2 (1.2.2-5) Date: 04 Aug 2003 20:03:11 +0100 X-archive-position: 4524 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: alan@lxorguk.ukuu.org.uk Precedence: bulk X-list: netdev On Llu, 2003-08-04 at 19:36, Perez-Gonzalez, Inaky wrote: > > Why would I want to spend money on some silly offload card when I can get > > the whole PC for less than the card? > > Because you want to stack 200 of those together in a huge > data center interconnecting whatever you want to interconnect > and you don't want your maintenance costs to go up to the sky? 17cm squared, fanless, network booting. Its not as big a cost as you might think, and TOE cards fail too, the difference being that if they are now out of production you have a nasty mess on your hands. From werner@almesberger.net Mon Aug 4 12:24:45 2003 Received: with ECARTIS (v1.0.0; list netdev); Mon, 04 Aug 2003 12:24:54 -0700 (PDT) Received: from host.almesberger.net (almesberger.net [63.105.73.239] (may be forged)) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h74JOiFl029623 for ; Mon, 4 Aug 2003 12:24:45 -0700 Received: from almesberger.net (vpnwa-home [10.200.0.2]) by host.almesberger.net (8.11.6/8.9.3) with ESMTP id h74JOdG11924; Mon, 4 Aug 2003 12:24:39 -0700 Received: (from werner@localhost) by almesberger.net (8.11.6/8.11.6) id h74JOXM16998; Mon, 4 Aug 2003 16:24:33 -0300 Date: Mon, 4 Aug 2003 16:24:33 -0300 From: Werner Almesberger To: "Eric W. Biederman" Cc: Jeff Garzik , Nivedita Singhvi , netdev@oss.sgi.com, linux-kernel@vger.kernel.org Subject: Re: TOE brain dump Message-ID: <20030804162433.L5798@almesberger.net> References: <20030802140444.E5798@almesberger.net> <3F2BF5C7.90400@us.ibm.com> <3F2C0C44.6020002@pobox.com> <20030802184901.G5798@almesberger.net> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: ; from ebiederm@xmission.com on Sun, Aug 03, 2003 at 01:21:09PM -0600 X-archive-position: 4525 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: werner@almesberger.net Precedence: bulk X-list: netdev Eric W. Biederman wrote: > The optimized for low latency cases seem to have a strong > market in clusters. Clusters have captive, no, _desperate_ customers ;-) And it seems that people are just as happy putting MPI as their transport on top of all those link-layer technologies. > There is one place in low latency communications that I can think > of where TCP/IP is not the proper solution. For low latency > communication the checksum is at the wrong end of the packet. That's one of the few things ATM's AAL5 got right. But in the end, I think it doesn't really matter. At 1 Gbps, an MTU-sized packet flies by within 13 us. At 10 Gbps, it's only 1.3 us. At that point, you may well treat it as an atomic unit. > On that score it is worth noting that the next generation of > peripheral busses (Hypertransport, PCI Express, etc) are all switched. And it's about time for that :-) - Werner -- _________________________________________________________________________ / Werner Almesberger, Buenos Aires, Argentina werner@almesberger.net / /_http://www.almesberger.net/____________________________________________/ From davem@redhat.com Mon Aug 4 12:31:11 2003 Received: with ECARTIS (v1.0.0; list netdev); Mon, 04 Aug 2003 12:31:16 -0700 (PDT) Received: from pizda.ninka.net (IDENT:root@pizda.ninka.net [216.101.162.242]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h74JVBFl030047 for ; Mon, 4 Aug 2003 12:31:11 -0700 Received: from pizda.ninka.net (IDENT:davem@localhost.localdomain [127.0.0.1]) by pizda.ninka.net (8.9.3/8.9.3) with SMTP id MAA20942; Mon, 4 Aug 2003 12:26:32 -0700 Date: Mon, 4 Aug 2003 12:26:32 -0700 From: "David S. Miller" To: Werner Almesberger Cc: ebiederm@xmission.com, jgarzik@pobox.com, niv@us.ibm.com, netdev@oss.sgi.com, linux-kernel@vger.kernel.org Subject: Re: TOE brain dump Message-Id: <20030804122632.65ba2122.davem@redhat.com> In-Reply-To: <20030804162433.L5798@almesberger.net> References: <20030802140444.E5798@almesberger.net> <3F2BF5C7.90400@us.ibm.com> <3F2C0C44.6020002@pobox.com> <20030802184901.G5798@almesberger.net> <20030804162433.L5798@almesberger.net> X-Mailer: Sylpheed version 0.9.2 (GTK+ 1.2.6; sparc-unknown-linux-gnu) Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit X-archive-position: 4526 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: davem@redhat.com Precedence: bulk X-list: netdev On Mon, 4 Aug 2003 16:24:33 -0300 Werner Almesberger wrote: > Eric W. Biederman wrote: > > There is one place in low latency communications that I can think > > of where TCP/IP is not the proper solution. For low latency > > communication the checksum is at the wrong end of the packet. > > That's one of the few things ATM's AAL5 got right. Let's recall how long the IFF_TRAILERS hack from BSD :-) > But in the end, I think it doesn't really matter. I tend to agree on this one. And on the transmit side if you have more than 1 pending TX frame, you can always be prefetching the next one into the fifo so that by the time the medium is ready all the checksum bits have been done. In fact I'd be surprised if current generation 1g/10g cards are not doing something like this. From hadi@cyberus.ca Mon Aug 4 12:43:12 2003 Received: with ECARTIS (v1.0.0; list netdev); Mon, 04 Aug 2003 12:43:17 -0700 (PDT) Received: from mail.cyberus.ca (mail.cyberus.ca [209.195.118.111]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h74JhAFl030573 for ; Mon, 4 Aug 2003 12:43:11 -0700 Received: from cpe0030ab124d2f-cm014500000962.cpe.net.cable.rogers.com ([24.103.99.32] helo=[10.0.0.9]) by mail.cyberus.ca with esmtp (Exim 4.12) id 19jlEc-0008eK-00; Mon, 04 Aug 2003 15:43:10 -0400 Subject: Re: TOE brain dump From: jamal Reply-To: hadi@cyberus.ca To: "Ihar 'Philips' Filipau" Cc: netdev@oss.sgi.com In-Reply-To: <3F2EAA78.60202@softhome.net> References: <1060015518.1103.399.camel@jzny.localdomain> <3F2EAA78.60202@softhome.net> Content-Type: text/plain Organization: jamalopolis Message-Id: <1060026149.1102.411.camel@jzny.localdomain> Mime-Version: 1.0 X-Mailer: Ximian Evolution 1.2.2 Date: 04 Aug 2003 15:42:29 -0400 Content-Transfer-Encoding: 7bit X-archive-position: 4527 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: hadi@cyberus.ca Precedence: bulk X-list: netdev On Mon, 2003-08-04 at 14:48, Ihar 'Philips' Filipau wrote: > jamal wrote: > > to nit: Its no longer about routing or bridging, friend. Thats like getting > > fries at mcdonalds. > > > > 1GE/10GE - for $5? > I'm first in the shoping queue!!!-))) > I thought you were talking about a 2 Gige interface doing routing, no? Do the math: Dell will happily sell you a (managed?) switch which has 8Giges on it for about $300. It does wire rate on all 8 interfaces. All ready to go in a 1U form factor. How much do you think that chip costs? Lets say it doesnt do L3, how much more do you think it will cost to do L3 in quantities? > Since I see no reasonable out-come of this discussion I left it. > > TOE as I see - since my company utilizes several of them - are too > different and too specialized to application/protocols. And yes - price > of development/deployment maters too. Linux support for those protocols > is inmature. It cannot handle or requirements even software-wise. I'm > not talking about timing requirements - linux network in general is not > (even soft) real-time. > Now this is anti-social talk;-> Why do you need to have realtime for any of this stuff? > My personal flame-meter is out of scale ;-) > I shall join the discussion back when I will see any real ideas. > Please dont dissapear, a lot of questions need answers;-> > > > If all you wanted was to do L3 - why not just buy a $5 chip that > > can do this for a lot more interfaces? Why sweat over > > optimizing L3 routing in a 3K space? > > We are doing not a teapot, and high level spec for this code takes > around 15 pages. > 3k - it is not optimized - we have limit around 2GB ;-) I am really confused now. We must be talking about different class of devices. NPUs as i know them are very limited in how much code you can stash them. In the 10K ranges is already overkill. Do you have any URL i can look at on what you are describing? > It just takes only 3k. And it handles some special (read - > proprietary) functions too - some bugs of some other pieces of hardware. > NPU does all stuff by itself, but sometimes we need to extract > configuration information which is direct to us, for example. Please provide me a pointer if you can - I am very interested in the 2G code space you mention. cheers, jamal > From filia@softhome.net Mon Aug 4 13:05:53 2003 Received: with ECARTIS (v1.0.0; list netdev); Mon, 04 Aug 2003 13:05:57 -0700 (PDT) Received: from jive.SoftHome.net (jive.SoftHome.net [66.54.152.27]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h74K5qFl031227 for ; Mon, 4 Aug 2003 13:05:53 -0700 Received: (qmail 22843 invoked by uid 417); 4 Aug 2003 20:05:52 -0000 Received: from shunt-smtp-out-0 (HELO softhome.net) (172.16.3.12) by shunt-smtp-out-0 with SMTP; 4 Aug 2003 20:05:52 -0000 Received: from softhome.net ([212.18.200.6]) (AUTH: PLAIN filia@softhome.net) by softhome.net with esmtp; Mon, 04 Aug 2003 14:05:51 -0600 Message-ID: <3F2EBCCA.5060708@softhome.net> Date: Mon, 04 Aug 2003 22:06:34 +0200 From: "Ihar 'Philips' Filipau" Organization: Home Sweet Home User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.4) Gecko/20030701 X-Accept-Language: en-us, en MIME-Version: 1.0 To: hadi@cyberus.ca CC: netdev@oss.sgi.com Subject: Re: TOE brain dump References: <1060015518.1103.399.camel@jzny.localdomain> <3F2EAA78.60202@softhome.net> <1060026149.1102.411.camel@jzny.localdomain> In-Reply-To: <1060026149.1102.411.camel@jzny.localdomain> Content-Type: text/plain; charset=us-ascii; format=flowed Content-Transfer-Encoding: 7bit X-archive-position: 4528 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: filia@softhome.net Precedence: bulk X-list: netdev jamal wrote: > >> It just takes only 3k. And it handles some special (read - >>proprietary) functions too - some bugs of some other pieces of hardware. >>NPU does all stuff by itself, but sometimes we need to extract >>configuration information which is direct to us, for example. > > > Please provide me a pointer if you can - I am very interested in the 2G > code space you mention. > I'm not sure - actually as I wrote - immediately gone checking specs. try: http://www.vitesse.com/products/categories.cfm?family_id=5&category_id=16 ... [ Okay I got to docs server. ] You are right - It has limit of 4K insns == 16k of executable memory. Sorry for confusion :( We really can address a lot of memory - we have 32MB for routing info and configuration - but for execution only 16kB of memory is available... From ak@muc.de Mon Aug 4 13:35:44 2003 Received: with ECARTIS (v1.0.0; list netdev); Mon, 04 Aug 2003 13:35:54 -0700 (PDT) Received: from colin2.muc.de (qmailr@colin2.muc.de [193.149.48.15]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h74KZgFl031967 for ; Mon, 4 Aug 2003 13:35:43 -0700 Received: (qmail 21559 invoked by uid 3709); 4 Aug 2003 20:35:24 -0000 Date: 4 Aug 2003 22:35:24 +0200 Date: Mon, 4 Aug 2003 22:35:24 +0200 From: Andi Kleen To: "David S. Miller" Cc: yoshfuji@linux-ipv6.org, ak@muc.de, netdev@oss.sgi.com Subject: Re: [PATCH] Make XFRM optional Message-ID: <20030804203524.GA15874@colin2.muc.de> References: <20030804125022.GA8167@averell> <20030804.215801.124854897.yoshfuji@linux-ipv6.org> <20030804130408.GA36367@colin2.muc.de> <20030804114507.6e496c77.davem@redhat.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20030804114507.6e496c77.davem@redhat.com> User-Agent: Mutt/1.4.1i X-archive-position: 4529 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: ak@colin2.muc.de Precedence: bulk X-list: netdev Ok, here is a new patch again current BKCVS. It also moves the inet_peer_idlock only inside netsyms. -Andi diff -u linux-xfrm/include/net/dst.h-XFRM linux-xfrm/include/net/dst.h --- linux-xfrm/include/net/dst.h-XFRM 2003-06-29 12:29:21.000000000 +0200 +++ linux-xfrm/include/net/dst.h 2003-08-04 22:16:49.000000000 +0200 @@ -247,8 +247,16 @@ extern void dst_init(void); struct flowi; +#ifndef CONFIG_XFRM +static inline int xfrm_lookup(struct dst_entry **dst_p, struct flowi *fl, + struct sock *sk, int flags) +{ + return 0; +} +#else extern int xfrm_lookup(struct dst_entry **dst_p, struct flowi *fl, struct sock *sk, int flags); #endif +#endif #endif /* _NET_DST_H */ diff -u linux-xfrm/include/net/xfrm.h-XFRM linux-xfrm/include/net/xfrm.h --- linux-xfrm/include/net/xfrm.h-XFRM 2003-08-04 22:09:46.000000000 +0200 +++ linux-xfrm/include/net/xfrm.h 2003-08-04 22:16:49.000000000 +0200 @@ -588,6 +588,8 @@ return !0; } +#ifdef CONFIG_XFRM + extern int __xfrm_policy_check(struct sock *, int dir, struct sk_buff *skb, unsigned short family); static inline int xfrm_policy_check(struct sock *sk, int dir, struct sk_buff *skb, unsigned short family) @@ -653,6 +655,26 @@ } } +#else + +static inline void xfrm_sk_free_policy(struct sock *sk) {} +static inline int xfrm_sk_clone_policy(struct sock *sk) { return 0; } +static inline int xfrm6_route_forward(struct sk_buff *skb) { return 1; } +static inline int xfrm4_route_forward(struct sk_buff *skb) { return 1; } +static inline int xfrm6_policy_check(struct sock *sk, int dir, struct sk_buff *skb) +{ + return 1; +} +static inline int xfrm4_policy_check(struct sock *sk, int dir, struct sk_buff *skb) +{ + return 1; +} +static inline int xfrm_policy_check(struct sock *sk, int dir, struct sk_buff *skb, unsigned short family) +{ + return 1; +} +#endif + static __inline__ xfrm_address_t *xfrm_flowi_daddr(struct flowi *fl, unsigned short family) { @@ -783,12 +805,32 @@ extern int xfrm_check_selectors(struct xfrm_state **x, int n, struct flowi *fl); extern int xfrm_check_output(struct xfrm_state *x, struct sk_buff *skb, unsigned short family); extern int xfrm4_rcv(struct sk_buff *skb); -extern int xfrm4_rcv_encap(struct sk_buff *skb, __u16 encap_type); extern int xfrm4_tunnel_register(struct xfrm_tunnel *handler); extern int xfrm4_tunnel_deregister(struct xfrm_tunnel *handler); extern int xfrm4_tunnel_check_size(struct sk_buff *skb); extern int xfrm6_rcv(struct sk_buff **pskb, unsigned int *nhoffp); + +#ifdef CONFIG_XFRM +extern int xfrm4_rcv_encap(struct sk_buff *skb, __u16 encap_type); extern int xfrm_user_policy(struct sock *sk, int optname, u8 *optval, int optlen); +extern int xfrm_dst_lookup(struct xfrm_dst **dst, struct flowi *fl, unsigned short family); +#else +static inline int xfrm_user_policy(struct sock *sk, int optname, u8 *optval, int optlen) +{ + return -ENOPROTOOPT; +} + +static inline int xfrm4_rcv_encap(struct sk_buff *skb, __u16 encap_type) +{ + /* should not happen */ + kfree_skb(skb); + return 0; +} +static inline int xfrm_dst_lookup(struct xfrm_dst **dst, struct flowi *fl, unsigned short family) +{ + return -EINVAL; +} +#endif void xfrm_policy_init(void); void xfrm4_policy_init(void); @@ -810,7 +852,6 @@ extern int xfrm_sk_policy_insert(struct sock *sk, int dir, struct xfrm_policy *pol); extern struct xfrm_policy *xfrm_sk_policy_lookup(struct sock *sk, int dir, struct flowi *fl); extern int xfrm_flush_bundles(struct xfrm_state *x); -extern int xfrm_dst_lookup(struct xfrm_dst **dst, struct flowi *fl, unsigned short family); extern wait_queue_head_t km_waitq; extern void km_state_expired(struct xfrm_state *x, int hard); diff -u linux-xfrm/net/core/skbuff.c-XFRM linux-xfrm/net/core/skbuff.c --- linux-xfrm/net/core/skbuff.c-XFRM 2003-06-19 09:21:04.000000000 +0200 +++ linux-xfrm/net/core/skbuff.c 2003-08-04 22:16:49.000000000 +0200 @@ -225,7 +225,7 @@ } dst_release(skb->dst); -#ifdef CONFIG_INET +#ifdef CONFIG_XFRM secpath_put(skb->sp); #endif if(skb->destructor) { diff -u linux-xfrm/net/ipv4/Kconfig-XFRM linux-xfrm/net/ipv4/Kconfig --- linux-xfrm/net/ipv4/Kconfig-XFRM 2003-08-04 22:09:47.000000000 +0200 +++ linux-xfrm/net/ipv4/Kconfig 2003-08-04 22:16:49.000000000 +0200 @@ -187,6 +187,7 @@ config NET_IPIP tristate "IP: tunneling" depends on INET + select XFRM ---help--- Tunneling means encapsulating data of one protocol type within another protocol and sending it over a channel that understands the @@ -205,6 +206,7 @@ config NET_IPGRE tristate "IP: GRE tunnels over IP" depends on INET + select XFRM help Tunneling means encapsulating data of one protocol type within another protocol and sending it over a channel that understands the @@ -343,6 +345,7 @@ config INET_AH tristate "IP: AH transformation" + select XFRM select CRYPTO select CRYPTO_HMAC select CRYPTO_MD5 @@ -354,6 +357,7 @@ config INET_ESP tristate "IP: ESP transformation" + select XFRM select CRYPTO select CRYPTO_HMAC select CRYPTO_MD5 @@ -366,6 +370,7 @@ config INET_IPCOMP tristate "IP: IPComp transformation" + select XFRM select CRYPTO select CRYPTO_DEFLATE ---help--- diff -u linux-xfrm/net/ipv4/Makefile-XFRM linux-xfrm/net/ipv4/Makefile --- linux-xfrm/net/ipv4/Makefile-XFRM 2003-08-04 22:09:47.000000000 +0200 +++ linux-xfrm/net/ipv4/Makefile 2003-08-04 22:16:49.000000000 +0200 @@ -23,4 +23,4 @@ obj-$(CONFIG_NETFILTER) += netfilter/ obj-$(CONFIG_IP_VS) += ipvs/ -obj-y += xfrm4_policy.o xfrm4_state.o xfrm4_input.o xfrm4_tunnel.o +obj-$(CONFIG_XFRM) += xfrm4_policy.o xfrm4_state.o xfrm4_input.o xfrm4_tunnel.o diff -u linux-xfrm/net/ipv4/route.c-XFRM linux-xfrm/net/ipv4/route.c --- linux-xfrm/net/ipv4/route.c-XFRM 2003-06-19 09:21:04.000000000 +0200 +++ linux-xfrm/net/ipv4/route.c 2003-08-04 22:16:49.000000000 +0200 @@ -2785,8 +2785,10 @@ create_proc_read_entry("net/rt_acct", 0, 0, ip_rt_acct_read, NULL); #endif #endif +#ifdef CONFIG_XFRM xfrm_init(); xfrm4_init(); +#endif out: return rc; out_enomem: diff -u linux-xfrm/net/ipv4/udp.c-XFRM linux-xfrm/net/ipv4/udp.c --- linux-xfrm/net/ipv4/udp.c-XFRM 2003-08-04 22:09:47.000000000 +0200 +++ linux-xfrm/net/ipv4/udp.c 2003-08-04 22:16:49.000000000 +0200 @@ -938,6 +938,9 @@ */ static int udp_encap_rcv(struct sock * sk, struct sk_buff *skb) { +#ifndef CONFIG_XFRM + return 1; +#else struct udp_opt *up = udp_sk(sk); struct udphdr *uh = skb->h.uh; struct iphdr *iph; @@ -997,10 +1000,12 @@ return -1; default: - printk(KERN_INFO "udp_encap_rcv(): Unhandled UDP encap type: %u\n", - encap_type); + if (net_ratelimit()) + printk(KERN_INFO "udp_encap_rcv(): Unhandled UDP encap type: %u\n", + encap_type); return 1; } +#endif } /* returns: diff -u linux-xfrm/net/ipv6/Kconfig-XFRM linux-xfrm/net/ipv6/Kconfig --- linux-xfrm/net/ipv6/Kconfig-XFRM 2003-08-04 22:09:48.000000000 +0200 +++ linux-xfrm/net/ipv6/Kconfig 2003-08-04 22:16:49.000000000 +0200 @@ -22,6 +22,7 @@ config INET6_AH tristate "IPv6: AH transformation" depends on IPV6 + select XFRM select CRYPTO select CRYPTO_HMAC select CRYPTO_MD5 @@ -34,6 +35,7 @@ config INET6_ESP tristate "IPv6: ESP transformation" depends on IPV6 + select XFRM select CRYPTO select CRYPTO_HMAC select CRYPTO_MD5 @@ -47,6 +49,7 @@ config INET6_IPCOMP tristate "IPv6: IPComp transformation" depends on IPV6 + select XFRM select CRYPTO select CRYPTO_DEFLATE ---help--- diff -u linux-xfrm/net/ipv6/Makefile-XFRM linux-xfrm/net/ipv6/Makefile --- linux-xfrm/net/ipv6/Makefile-XFRM 2003-06-14 12:19:38.000000000 +0200 +++ linux-xfrm/net/ipv6/Makefile 2003-08-04 22:16:49.000000000 +0200 @@ -8,8 +8,9 @@ route.o ip6_fib.o ipv6_sockglue.o ndisc.o udp.o raw.o \ protocol.o icmp.o mcast.o reassembly.o tcp_ipv6.o \ exthdrs.o sysctl_net_ipv6.o datagram.o proc.o \ - ip6_flowlabel.o ipv6_syms.o \ - xfrm6_policy.o xfrm6_state.o xfrm6_input.o + ip6_flowlabel.o ipv6_syms.o + +obj-$(CONFIG_XFRM) += xfrm6_policy.o xfrm6_state.o xfrm6_input.o obj-$(CONFIG_INET6_AH) += ah6.o obj-$(CONFIG_INET6_ESP) += esp6.o diff -u linux-xfrm/net/ipv6/ipv6_syms.c-XFRM linux-xfrm/net/ipv6/ipv6_syms.c --- linux-xfrm/net/ipv6/ipv6_syms.c-XFRM 2003-06-16 09:04:50.000000000 +0200 +++ linux-xfrm/net/ipv6/ipv6_syms.c 2003-08-04 22:16:49.000000000 +0200 @@ -36,7 +36,9 @@ EXPORT_SYMBOL(in6addr_loopback); EXPORT_SYMBOL(in6_dev_finish_destroy); EXPORT_SYMBOL(ip6_find_1stfragopt); +#ifdef CONFIG_XFRM EXPORT_SYMBOL(xfrm6_rcv); +#endif EXPORT_SYMBOL(rt6_lookup); EXPORT_SYMBOL(fl6_sock_lookup); EXPORT_SYMBOL(ipv6_ext_hdr); diff -u linux-xfrm/net/ipv6/route.c-XFRM linux-xfrm/net/ipv6/route.c --- linux-xfrm/net/ipv6/route.c-XFRM 2003-08-04 22:09:48.000000000 +0200 +++ linux-xfrm/net/ipv6/route.c 2003-08-04 22:16:49.000000000 +0200 @@ -1988,7 +1988,9 @@ if (p) p->proc_fops = &rt6_stats_seq_fops; #endif +#ifdef CONFIG_XFRM xfrm6_init(); +#endif } #ifdef MODULE diff -u linux-xfrm/net/xfrm/Kconfig-XFRM linux-xfrm/net/xfrm/Kconfig --- linux-xfrm/net/xfrm/Kconfig-XFRM 2003-06-14 12:19:38.000000000 +0200 +++ linux-xfrm/net/xfrm/Kconfig 2003-08-04 22:16:49.000000000 +0200 @@ -1,9 +1,13 @@ # # XFRM configuration # +config XFRM + bool + depends on NET + config XFRM_USER tristate "IPsec user configuration interface" - depends on INET + depends on INET && XFRM ---help--- Support for IPsec user configuration interface used by native Linux tools. diff -u linux-xfrm/net/xfrm/Makefile-XFRM linux-xfrm/net/xfrm/Makefile --- linux-xfrm/net/xfrm/Makefile-XFRM 2003-06-14 12:19:38.000000000 +0200 +++ linux-xfrm/net/xfrm/Makefile 2003-08-04 22:16:49.000000000 +0200 @@ -2,6 +2,7 @@ # Makefile for the XFRM subsystem. # -obj-y := xfrm_policy.o xfrm_state.o xfrm_input.o xfrm_algo.o xfrm_output.o +obj-$(CONFIG_XFRM) := xfrm_policy.o xfrm_state.o xfrm_input.o xfrm_algo.o xfrm_output.o \ + xfrm_export.o obj-$(CONFIG_XFRM_USER) += xfrm_user.o diff -u linux-xfrm/net/Kconfig-XFRM linux-xfrm/net/Kconfig --- linux-xfrm/net/Kconfig-XFRM 2003-08-04 22:09:47.000000000 +0200 +++ linux-xfrm/net/Kconfig 2003-08-04 22:16:49.000000000 +0200 @@ -83,6 +83,7 @@ config NET_KEY tristate "PF_KEY sockets" + select XFRM ---help--- PF_KEYv2 socket family, compatible to KAME ones. They are required if you are going to use IPsec tools ported diff -u linux-xfrm/net/netsyms.c-XFRM linux-xfrm/net/netsyms.c --- linux-xfrm/net/netsyms.c-XFRM 2003-08-04 22:09:47.000000000 +0200 +++ linux-xfrm/net/netsyms.c 2003-08-04 22:19:14.000000000 +0200 @@ -56,7 +56,6 @@ #include #include #include -#include #if defined(CONFIG_INET_AH) || defined(CONFIG_INET_AH_MODULE) || defined(CONFIG_INET6_AH) || defined(CONFIG_INET6_AH_MODULE) #include #endif @@ -276,6 +275,7 @@ EXPORT_SYMBOL(inetdev_by_index); EXPORT_SYMBOL(in_dev_finish_destroy); EXPORT_SYMBOL(ip_defrag); +EXPORT_SYMBOL(inet_peer_idlock); /* Route manipulation */ EXPORT_SYMBOL(ip_rt_ioctl); @@ -293,80 +293,6 @@ /* needed for ip_gre -cw */ EXPORT_SYMBOL(ip_statistics); - -EXPORT_SYMBOL(xfrm_user_policy); -EXPORT_SYMBOL(km_waitq); -EXPORT_SYMBOL(km_new_mapping); -EXPORT_SYMBOL(xfrm_cfg_sem); -EXPORT_SYMBOL(xfrm_policy_alloc); -EXPORT_SYMBOL(__xfrm_policy_destroy); -EXPORT_SYMBOL(xfrm_lookup); -EXPORT_SYMBOL(__xfrm_policy_check); -EXPORT_SYMBOL(__xfrm_route_forward); -EXPORT_SYMBOL(xfrm_state_alloc); -EXPORT_SYMBOL(__xfrm_state_destroy); -EXPORT_SYMBOL(xfrm_state_find); -EXPORT_SYMBOL(xfrm_state_insert); -EXPORT_SYMBOL(xfrm_state_add); -EXPORT_SYMBOL(xfrm_state_update); -EXPORT_SYMBOL(xfrm_state_check_expire); -EXPORT_SYMBOL(xfrm_state_check_space); -EXPORT_SYMBOL(xfrm_state_lookup); -EXPORT_SYMBOL(xfrm_state_register_afinfo); -EXPORT_SYMBOL(xfrm_state_unregister_afinfo); -EXPORT_SYMBOL(xfrm_state_get_afinfo); -EXPORT_SYMBOL(xfrm_state_put_afinfo); -EXPORT_SYMBOL(xfrm_state_delete_tunnel); -EXPORT_SYMBOL(xfrm_replay_check); -EXPORT_SYMBOL(xfrm_replay_advance); -EXPORT_SYMBOL(xfrm_check_selectors); -EXPORT_SYMBOL(xfrm_check_output); -EXPORT_SYMBOL(__secpath_destroy); -EXPORT_SYMBOL(secpath_dup); -EXPORT_SYMBOL(xfrm_get_acqseq); -EXPORT_SYMBOL(xfrm_parse_spi); -EXPORT_SYMBOL(xfrm4_rcv); -EXPORT_SYMBOL(xfrm4_tunnel_register); -EXPORT_SYMBOL(xfrm4_tunnel_deregister); -EXPORT_SYMBOL(xfrm4_tunnel_check_size); -EXPORT_SYMBOL(xfrm_register_type); -EXPORT_SYMBOL(xfrm_unregister_type); -EXPORT_SYMBOL(xfrm_get_type); -EXPORT_SYMBOL(inet_peer_idlock); -EXPORT_SYMBOL(xfrm_register_km); -EXPORT_SYMBOL(xfrm_unregister_km); -EXPORT_SYMBOL(xfrm_state_delete); -EXPORT_SYMBOL(xfrm_state_walk); -EXPORT_SYMBOL(xfrm_find_acq_byseq); -EXPORT_SYMBOL(xfrm_find_acq); -EXPORT_SYMBOL(xfrm_alloc_spi); -EXPORT_SYMBOL(xfrm_state_flush); -EXPORT_SYMBOL(xfrm_policy_kill); -EXPORT_SYMBOL(xfrm_policy_bysel); -EXPORT_SYMBOL(xfrm_policy_insert); -EXPORT_SYMBOL(xfrm_policy_walk); -EXPORT_SYMBOL(xfrm_policy_flush); -EXPORT_SYMBOL(xfrm_policy_byid); -EXPORT_SYMBOL(xfrm_policy_list); -EXPORT_SYMBOL(xfrm_dst_lookup); -EXPORT_SYMBOL(xfrm_policy_register_afinfo); -EXPORT_SYMBOL(xfrm_policy_unregister_afinfo); -EXPORT_SYMBOL(xfrm_policy_get_afinfo); -EXPORT_SYMBOL(xfrm_policy_put_afinfo); - -EXPORT_SYMBOL_GPL(xfrm_probe_algs); -EXPORT_SYMBOL_GPL(xfrm_count_auth_supported); -EXPORT_SYMBOL_GPL(xfrm_count_enc_supported); -EXPORT_SYMBOL_GPL(xfrm_aalg_get_byidx); -EXPORT_SYMBOL_GPL(xfrm_ealg_get_byidx); -EXPORT_SYMBOL_GPL(xfrm_calg_get_byidx); -EXPORT_SYMBOL_GPL(xfrm_aalg_get_byid); -EXPORT_SYMBOL_GPL(xfrm_ealg_get_byid); -EXPORT_SYMBOL_GPL(xfrm_calg_get_byid); -EXPORT_SYMBOL_GPL(xfrm_aalg_get_byname); -EXPORT_SYMBOL_GPL(xfrm_ealg_get_byname); -EXPORT_SYMBOL_GPL(xfrm_calg_get_byname); -EXPORT_SYMBOL_GPL(skb_icv_walk); #if defined(CONFIG_INET_ESP) || defined(CONFIG_INET_ESP_MODULE) || defined(CONFIG_INET6_ESP) || defined(CONFIG_INET6_ESP_MODULE) EXPORT_SYMBOL_GPL(skb_cow_data); EXPORT_SYMBOL_GPL(pskb_put); From shemminger@osdl.org Mon Aug 4 16:43:26 2003 Received: with ECARTIS (v1.0.0; list netdev); Mon, 04 Aug 2003 16:43:34 -0700 (PDT) Received: from mail.osdl.org (fw.osdl.org [65.172.181.6]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h74NhPFl004508 for ; Mon, 4 Aug 2003 16:43:26 -0700 Received: from dell_ss3.pdx.osdl.net (dell_ss3.pdx.osdl.net [172.20.1.60]) by mail.osdl.org (8.11.6/8.11.6) with SMTP id h74NLlI08855; Mon, 4 Aug 2003 16:21:47 -0700 Date: Mon, 4 Aug 2003 16:21:47 -0700 From: Stephen Hemminger To: Jeff Garzik Cc: netdev@oss.sgi.com Subject: [PATCH] convert lp486e driver to dynamic allocation Message-Id: <20030804162147.591c55f6.shemminger@osdl.org> Organization: Open Source Development Lab X-Mailer: Sylpheed version 0.9.3claws (GTK+ 1.2.10; i686-pc-linux-gnu) X-Face: &@E+xe?c%:&e4D{>f1O<&U>2qwRREG5!}7R4;D<"NO^UI2mJ[eEOA2*3>(`Th.yP,VDPo9$ /`~cw![cmj~~jWe?AHY7D1S+\}5brN0k*NE?pPh_'_d>6;XGG[\KDRViCfumZT3@[ Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit X-archive-position: 4530 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: shemminger@osdl.org Precedence: bulk X-list: netdev Convert this driver from static net_device to using alloc_etherdev. Patch against 2.6.0-test2. Loads and unloads, but don't have the actual hardware. diff -Nru a/drivers/net/lp486e.c b/drivers/net/lp486e.c --- a/drivers/net/lp486e.c Mon Aug 4 14:53:55 2003 +++ b/drivers/net/lp486e.c Mon Aug 4 14:53:55 2003 @@ -975,15 +975,7 @@ return -EBUSY; } - /* - * Allocate working memory, 16-byte aligned - */ - dev->mem_start = (unsigned long) kmalloc(sizeof(struct i596_private) + 0x0f, GFP_KERNEL); - if (!dev->mem_start) - goto err_out; - dev->priv = (void *)((dev->mem_start + 0xf) & 0xfffffff0); lp = (struct i596_private *) dev->priv; - memset((void *)lp, 0, sizeof(struct i596_private)); spin_lock_init(&lp->cmd_lock); /* @@ -997,7 +989,6 @@ dev->base_addr = IOADDR; dev->irq = IRQ; - ether_setup(dev); /* * How do we find the ethernet address? I don't know. @@ -1045,8 +1036,6 @@ return 0; err_out_kfree: - kfree ((void *) dev->mem_start); -err_out: release_region(IOADDR, LP486E_TOTAL_SIZE); return ret; } @@ -1318,29 +1307,36 @@ MODULE_PARM(options, "1-" __MODULE_STRING(MAX_UNITS) "i"); MODULE_PARM(full_duplex, "1-" __MODULE_STRING(MAX_UNITS) "i"); -static struct net_device dev_lp486e; +static struct net_device *dev_lp486e; static int full_duplex; static int options; static int io = IOADDR; static int irq = IRQ; static int __init lp486e_init_module(void) { - struct net_device *dev = &dev_lp486e; + struct net_device *dev; + + dev = alloc_etherdev(sizeof(struct i596_private)); + if (!dev) + return -ENOMEM; + dev->irq = irq; dev->base_addr = io; dev->init = lp486e_probe; - if (register_netdev(dev) != 0) + if (register_netdev(dev) != 0) { + kfree(dev); return -EIO; + } + dev_lp486e = dev; full_duplex = 0; options = 0; return 0; } static void __exit lp486e_cleanup_module(void) { - unregister_netdev(&dev_lp486e); - kfree((void *)dev_lp486e.mem_start); - dev_lp486e.priv = NULL; - release_region(dev_lp486e.base_addr, LP486E_TOTAL_SIZE); + unregister_netdev(dev_lp486e); + release_region(dev_lp486e->base_addr, LP486E_TOTAL_SIZE); + kfree(dev_lp486e); } module_init(lp486e_init_module); From davem@redhat.com Mon Aug 4 16:53:51 2003 Received: with ECARTIS (v1.0.0; list netdev); Mon, 04 Aug 2003 16:53:55 -0700 (PDT) Received: from pizda.ninka.net (IDENT:root@pizda.ninka.net [216.101.162.242]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h74NroFl004999 for ; Mon, 4 Aug 2003 16:53:51 -0700 Received: from pizda.ninka.net (IDENT:davem@localhost.localdomain [127.0.0.1]) by pizda.ninka.net (8.9.3/8.9.3) with SMTP id QAA21337; Mon, 4 Aug 2003 16:49:20 -0700 Date: Mon, 4 Aug 2003 16:49:20 -0700 From: "David S. Miller" To: Andi Kleen Cc: yoshfuji@linux-ipv6.org, ak@muc.de, netdev@oss.sgi.com Subject: Re: [PATCH] Make XFRM optional Message-Id: <20030804164920.371d5afd.davem@redhat.com> In-Reply-To: <20030804203524.GA15874@colin2.muc.de> References: <20030804125022.GA8167@averell> <20030804.215801.124854897.yoshfuji@linux-ipv6.org> <20030804130408.GA36367@colin2.muc.de> <20030804114507.6e496c77.davem@redhat.com> <20030804203524.GA15874@colin2.muc.de> X-Mailer: Sylpheed version 0.9.2 (GTK+ 1.2.6; sparc-unknown-linux-gnu) Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit X-archive-position: 4531 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: davem@redhat.com Precedence: bulk X-list: netdev On 4 Aug 2003 22:35:24 +0200 Andi Kleen wrote: > Ok, here is a new patch again current BKCVS. It also moves the > inet_peer_idlock only inside netsyms. Appied, thanks Andi. From davem@redhat.com Mon Aug 4 16:56:06 2003 Received: with ECARTIS (v1.0.0; list netdev); Mon, 04 Aug 2003 16:56:12 -0700 (PDT) Received: from pizda.ninka.net (IDENT:root@pizda.ninka.net [216.101.162.242]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h74Nu5Fl005356 for ; Mon, 4 Aug 2003 16:56:06 -0700 Received: from pizda.ninka.net (IDENT:davem@localhost.localdomain [127.0.0.1]) by pizda.ninka.net (8.9.3/8.9.3) with SMTP id QAA21354; Mon, 4 Aug 2003 16:51:37 -0700 Date: Mon, 4 Aug 2003 16:51:37 -0700 From: "David S. Miller" To: Andi Kleen Cc: yoshfuji@linux-ipv6.org, ak@muc.de, netdev@oss.sgi.com Subject: Re: [PATCH] Make XFRM optional Message-Id: <20030804165137.40d744c5.davem@redhat.com> In-Reply-To: <20030804203524.GA15874@colin2.muc.de> References: <20030804125022.GA8167@averell> <20030804.215801.124854897.yoshfuji@linux-ipv6.org> <20030804130408.GA36367@colin2.muc.de> <20030804114507.6e496c77.davem@redhat.com> <20030804203524.GA15874@colin2.muc.de> X-Mailer: Sylpheed version 0.9.2 (GTK+ 1.2.6; sparc-unknown-linux-gnu) Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit X-archive-position: 4532 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: davem@redhat.com Precedence: bulk X-list: netdev On 4 Aug 2003 22:35:24 +0200 Andi Kleen wrote: > Ok, here is a new patch again current BKCVS. It also moves the inet_peer_idlock > only inside netsyms. This one is missing net/xfrm/xfrm_export.c :-( Andi, please be more careful with your patches. I'd suggest use subversions or whatever source management system you like best to help avoid these problems in the future. You seem to be chronicly making mistakes like this, as if you're rushing things. From davem@redhat.com Mon Aug 4 17:02:45 2003 Received: with ECARTIS (v1.0.0; list netdev); Mon, 04 Aug 2003 17:02:51 -0700 (PDT) Received: from pizda.ninka.net (IDENT:root@pizda.ninka.net [216.101.162.242]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h7502iFl005850 for ; Mon, 4 Aug 2003 17:02:44 -0700 Received: from pizda.ninka.net (IDENT:davem@localhost.localdomain [127.0.0.1]) by pizda.ninka.net (8.9.3/8.9.3) with SMTP id QAA21376; Mon, 4 Aug 2003 16:57:46 -0700 Date: Mon, 4 Aug 2003 16:57:46 -0700 From: "David S. Miller" To: Krishna Kumar Cc: kuznet@ms2.inr.ac.ru, yoshfuji@linux-ipv6.org, netdev@oss.sgi.com, krkumar@us.ibm.com Subject: Re: O/M flags against 2.6.0-test1 Message-Id: <20030804165746.133f370a.davem@redhat.com> In-Reply-To: References: <20030730220223.4c25fcfe.davem@redhat.com> X-Mailer: Sylpheed version 0.9.2 (GTK+ 1.2.6; sparc-unknown-linux-gnu) Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit X-archive-position: 4533 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: davem@redhat.com Precedence: bulk X-list: netdev On Thu, 31 Jul 2003 13:33:27 -0700 (PDT) Krishna Kumar wrote: > > Ok, but then please use "__s32". > > OK, slowly getting there :-) > > Latest patch follows : Krishna is away, but let us make more progress on this patch. I see some problem with it that still need to be resolved: > +/* Subtype attributes for IFLA_PROTINFO */ > +enum > +{ > + IFLA_INET6_UNSPEC, > + IFLA_INET6_FLAGS, /* link flags */ > + IFLA_INET6_CONF, /* sysctl parameters */ > + IFLA_INET6_STATS, /* statistics */ > + IFLA_INET6_MCAST, /* MC things. What of them? */ > +}; > + > +#define IFLA_INET6_MAX IFLA_INET6_MCAST Ok, how does this actually work? The code does RTA_PUT(...IFLA_INET6_*...) but IFLA_PROTINFO is not actually used anywhere. This cannot work, it makes these RTA attributes just look like whatever IFLA_* ones have the same values as the inet6 ones in this enumeration. Alexey, how did you intend this stuff to be done? Cerainly not like this :-) > + /* return the device sysctl params */ > + if ((array = kmalloc(DEVCONF_MAX * sizeof(*array), GFP_KERNEL)) == NULL) > + goto rtattr_failure; > + ipv6_store_devconf(&idev->cnf, array); > + RTA_PUT(skb, IFLA_INET6_CONF, DEVCONF_MAX * sizeof(*array), array); This is what I'm talking about. Maybe there is something I'm missing. How does APP know to interpret IFLA_INET6_CONF as "sub-attribute" of IFLA_PROTINFO? Also, missing "memset(array, 0, sizeof(*array));" else we leak uninitialized kernel memory into user space. Another bug, GFP_KERNEL memory allocation with dev_base_lock held. Otherwise I am OK with the patch. From davem@redhat.com Mon Aug 4 17:03:29 2003 Received: with ECARTIS (v1.0.0; list netdev); Mon, 04 Aug 2003 17:03:33 -0700 (PDT) Received: from pizda.ninka.net (IDENT:root@pizda.ninka.net [216.101.162.242]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h7503TFl006007 for ; Mon, 4 Aug 2003 17:03:29 -0700 Received: from pizda.ninka.net (IDENT:davem@localhost.localdomain [127.0.0.1]) by pizda.ninka.net (8.9.3/8.9.3) with SMTP id QAA21402; Mon, 4 Aug 2003 16:59:04 -0700 Date: Mon, 4 Aug 2003 16:59:04 -0700 From: "David S. Miller" To: Stephen Hemminger Cc: jgarzik@pobox.com, netdev@oss.sgi.com Subject: Re: [PATCH] convert lp486e driver to dynamic allocation Message-Id: <20030804165904.0e9f60ab.davem@redhat.com> In-Reply-To: <20030804162147.591c55f6.shemminger@osdl.org> References: <20030804162147.591c55f6.shemminger@osdl.org> X-Mailer: Sylpheed version 0.9.2 (GTK+ 1.2.6; sparc-unknown-linux-gnu) Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit X-archive-position: 4534 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: davem@redhat.com Precedence: bulk X-list: netdev On Mon, 4 Aug 2003 16:21:47 -0700 Stephen Hemminger wrote: > Convert this driver from static net_device to using alloc_etherdev. Applied, thanks. From scott.feldman@intel.com Mon Aug 4 20:45:15 2003 Received: with ECARTIS (v1.0.0; list netdev); Mon, 04 Aug 2003 20:45:24 -0700 (PDT) Received: from hermes.jf.intel.com (fmr05.intel.com [134.134.136.6]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h753jFFl017502 for ; Mon, 4 Aug 2003 20:45:15 -0700 Received: from petasus.jf.intel.com (petasus.jf.intel.com [10.7.209.6]) by hermes.jf.intel.com (8.11.6p2/8.11.6/d: outer.mc,v 1.66 2003/05/22 21:17:36 rfjohns1 Exp $) with ESMTP id h753h3x04231 for ; Tue, 5 Aug 2003 03:43:03 GMT Received: from orsmsxvs040.jf.intel.com (orsmsxvs040.jf.intel.com [192.168.65.206]) by petasus.jf.intel.com (8.11.6p2/8.11.6/d: inner.mc,v 1.35 2003/05/22 21:18:01 rfjohns1 Exp $) with SMTP id h753eBv26263 for ; Tue, 5 Aug 2003 03:40:11 GMT Received: from orsmsx332.amr.corp.intel.com ([192.168.65.60]) by orsmsxvs040.jf.intel.com (NAVGW 2.5.2.11) with SMTP id M2003080420571626008 ; Mon, 04 Aug 2003 20:57:16 -0700 Received: from orsmsx402.amr.corp.intel.com ([192.168.65.208]) by orsmsx332.amr.corp.intel.com with Microsoft SMTPSVC(5.0.2195.5329); Mon, 4 Aug 2003 20:45:09 -0700 content-class: urn:content-classes:message MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" X-MimeOLE: Produced By Microsoft Exchange V6.0.6375.0 Subject: RE: e100 "Ferguson" release Date: Mon, 4 Aug 2003 20:45:08 -0700 Message-ID: X-MS-Has-Attach: X-MS-TNEF-Correlator: Thread-Topic: e100 "Ferguson" release Thread-Index: AcNZhWYRC0Gz1n9oToGU+hvgKaMpJwBWPZNQ From: "Feldman, Scott" To: "Jeff Garzik" Cc: X-OriginalArrivalTime: 05 Aug 2003 03:45:09.0070 (UTC) FILETIME=[F76DBEE0:01C35B03] Content-Transfer-Encoding: 8bit X-MIME-Autoconverted: from quoted-printable to 8bit by oss.sgi.com id h753jFFl017502 X-archive-position: 4535 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: scott.feldman@intel.com Precedence: bulk X-list: netdev New one: http://sf.net/projects/e1000, e100-3.0.0_dev12.tar.gz > Comments: Thanks Jeff! > * (API) Does the out-of-tx-resources condition in > e100_xmit_frame ever really happen? I am under the > impression that returning non-zero in ->hard_start_xmit > results in the packet sometimes being requeued and > sometimes dropped. I prefer to guarantee a more-steady > state, by simply dropping the packet unconditionally, > when this uncommon condition occurs. So, I would > a) mark the failure condition with unlikely(), and > b) if the condition occurs, simply drop the packet > (tx_dropped++, kfree > skb), and return zero. Stop the queue also? if(unlikely(e100_exec_cb(nic, skb, e100_xmit_prepare) == -ENOMEM)) { netif_stop_queue(netdev); nic->net_stats.tx_dropped++; dev_kfree_skb(skb); return 0; } Added some more likely/unlikely's in the perf paths. > * (minor) for completeness, you should limit the PCI class in the > pci_device_id table to PCI_CLASS_NETWORK_ETHERNET. There are > one-in-a-million cases where this matters, but it's usually a > BIOS bug. Still, it's there in pci_device_id table, and it's an easy > change, so might as well use it. OK > * (style) your struct config definition is terribly clever. > perhaps too clever, making it unreadable? Not a specific complaint, > mind you, just something that caught my eye. Then the driver would be perfect. We can't have that. ;-) > * (minor) in tg3, my own benchmarks and experiments showed it > helped to explictly use ____cacheline_aligned markers when > defining certain sections of members in struct tg3 > (or struct nic, in e100's case). You already clearly pay > attention to member layout WRT cache effects, but if > you have a clear dividing line, or lines, in struct nic you can use > _____cacheline_aligned for even greater benefit. At a > minimum test it with a cpu-usage-measuring benchmark like ttcp, > though, of course :) OK > * (extremely minor) some people (like me :)) consider dead reads like > the readb() call in e100_write_flush OK > * (major?) Aren't there some clunky e100 adapters that don't do MMIO? > Do we care? Not that I'm aware of. Current e100 doesn't support them if they're out there. > * I would love to see feedback from people testing this > driver on ppc64 and sparc64, particularly. Me too. Things seem to work on ppc (Mac) and ia64. > * (style, minor) My eyes would prefer functions like e100_hw_reset to > have a few more blank lines in them, spreading code+comment > blocks out a bit. OK > * (extremely minor) one wonders if you really need the write flush in > mdio_ctrl. If the flush is removed, the same net effect > appears to occur. Good catch. > * (boring but needed) convert all the magic numbers in e100_configure > into constants, or at least add comments describing the magic > numbers. If the value is just one bit, you might simply append "/* > true */", for example. The general idea is to make the "member name = > value" list a little bit more readable to somebody who doesn't know the > hardware, and struct config, intimately. That _was_ boring. > * IIRC Donald's MII phy scanning code scans MII phy ids like this: > 1..31,0. Or maybe 1..31, and then 0 iff no MII phys were found. In > general I would prefer to follow his eepro100.c probe order. > Some phys need this because they will report on both phy id #0 (which > is magical) and phy id #(non-zero). Donald would know more than me, here. [kernel] eepro100 gets the ID from the eeprom, so no scanning there. Current e100 goes 1, 0..31, which is what we've always done, IIRC. > * Is it easy to support MII phy interrupts? It would be nice > to get a callback that was handled immediately, on phys that > do support such interrupts. I don't see those being passed through and handled by the MAC. > * do we care about spinlocks around the update_stats and > get_stats code? Not sure. update_stats runs in a timer callback. Can get_stats jump in? > * (bugs) in e100_up, you should undo mod_timer [major] and > netif_start_queue [minor], if request_irq fails. And maybe stop the > receiver, too? OK > * for all constants 0xffffffff (and others as well if you so choose), > prefer the C99 suffix to a cast. This is particularly relevant in > pci_set_dma_mask calls, where one should be using 0xffffffffULL, but > applies to other constants as well. I didn't see any other constant casts besides the pci_set_dma_mask call. That one is fixed. > * (potential races) in e100_probe, you want to call > register_netdev as basically the last operation that can > fail, if possible. Particularly, you need to move the > PCI API operations above register_netdev. > Remember, register_netdev winds up calling /sbin/hotplug, > which in turn calls programs that will want to start using > the interface. So you need to have everything set up by > that point, really. OK (nice catch). > * in e100_probe, "if(nic->csr == 0UL) {" should really just test for > NULL, because ioremap is defined to return a pointer... OK > * (minor) use a netif_msg_xxx wrapper/constant in > e100_init_module test? Can't - don't have nic->msg_enable allocated yet. :( -scott From jgarzik@pobox.com Mon Aug 4 22:29:55 2003 Received: with ECARTIS (v1.0.0; list netdev); Mon, 04 Aug 2003 22:30:40 -0700 (PDT) Received: from www.linux.org.uk (IDENT:h2Rxu3GU7PMeJPrMvDTL9MOVi3QhHI88@parcelfarce.linux.theplanet.co.uk [195.92.249.252]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h755TsFl022666 for ; Mon, 4 Aug 2003 22:29:55 -0700 Received: from rdu26-227-011.nc.rr.com ([66.26.227.11] helo=pobox.com) by www.linux.org.uk with esmtp (Exim 4.14) id 19juOO-00060o-CI; Tue, 05 Aug 2003 06:29:52 +0100 Message-ID: <3F2F40C5.9070601@pobox.com> Date: Tue, 05 Aug 2003 01:29:41 -0400 From: Jeff Garzik Organization: none User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.2.1) Gecko/20021213 Debian/1.2.1-2.bunk X-Accept-Language: en MIME-Version: 1.0 To: "Feldman, Scott" CC: netdev@oss.sgi.com Subject: Re: e100 "Ferguson" release References: In-Reply-To: Content-Type: text/plain; charset=us-ascii; format=flowed Content-Transfer-Encoding: 7bit X-archive-position: 4536 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: jgarzik@pobox.com Precedence: bulk X-list: netdev Feldman, Scott wrote: >>* (API) Does the out-of-tx-resources condition in >>e100_xmit_frame ever really happen? I am under the >>impression that returning non-zero in ->hard_start_xmit >>results in the packet sometimes being requeued and >>sometimes dropped. I prefer to guarantee a more-steady >>state, by simply dropping the packet unconditionally, >>when this uncommon condition occurs. So, I would >>a) mark the failure condition with unlikely(), and >>b) if the condition occurs, simply drop the packet >>(tx_dropped++, kfree >>skb), and return zero. > > > Stop the queue also? > > if(unlikely(e100_exec_cb(nic, skb, e100_xmit_prepare) == -ENOMEM)) { > netif_stop_queue(netdev); > nic->net_stats.tx_dropped++; > dev_kfree_skb(skb); > return 0; > } Yes. I would also printk(KERN_ERR "we have a bug!") or somesuch, like several other drivers do, too. >>* IIRC Donald's MII phy scanning code scans MII phy ids like this: >>1..31,0. Or maybe 1..31, and then 0 iff no MII phys were found. In >>general I would prefer to follow his eepro100.c probe order. >>Some phys need this because they will report on both phy id #0 (which >>is magical) and phy id #(non-zero). Donald would know more than me, > > here. > > [kernel] eepro100 gets the ID from the eeprom, so no scanning there. > Current e100 goes 1, 0..31, which is what we've always done, IIRC. hmmm. I prefer the phy scanning to checking eeprom, since it reduces the chance of eeprom screwups. However, I still think there's some issue related to phy id #0. Oh well, fine for now, I guess. >>* do we care about spinlocks around the update_stats and >>get_stats code? > > > Not sure. update_stats runs in a timer callback. Can get_stats jump > in? Well, the ->get_stats only returns a pointer to the stats, which are then accessed in an unlocked manner. Since the net stats are unsigned longs, asynchronously reading and updating them isn't a big deal in practice. >>* (minor) use a netif_msg_xxx wrapper/constant in >>e100_init_module test? > > > Can't - don't have nic->msg_enable allocated yet. :( You could always use "(1 << debug) - 1"... :) I dunno if it's worth worrying about. Jeff From davem@redhat.com Tue Aug 5 00:21:41 2003 Received: with ECARTIS (v1.0.0; list netdev); Tue, 05 Aug 2003 00:22:20 -0700 (PDT) Received: from pizda.ninka.net (IDENT:root@pizda.ninka.net [216.101.162.242]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h757L0Fl030028 for ; Tue, 5 Aug 2003 00:21:41 -0700 Received: from pizda.ninka.net (IDENT:davem@localhost.localdomain [127.0.0.1]) by pizda.ninka.net (8.9.3/8.9.3) with SMTP id AAA22135; Tue, 5 Aug 2003 00:16:31 -0700 Date: Tue, 5 Aug 2003 00:16:31 -0700 From: "David S. Miller" To: "Feldman, Scott" Cc: jgarzik@pobox.com, netdev@oss.sgi.com Subject: Re: e100 "Ferguson" release Message-Id: <20030805001631.2fb55f38.davem@redhat.com> In-Reply-To: References: X-Mailer: Sylpheed version 0.9.2 (GTK+ 1.2.6; sparc-unknown-linux-gnu) Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit X-archive-position: 4537 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: davem@redhat.com Precedence: bulk X-list: netdev On Mon, 4 Aug 2003 20:45:08 -0700 "Feldman, Scott" wrote: > > * I would love to see feedback from people testing this > > driver on ppc64 and sparc64, particularly. > > Me too. Things seem to work on ppc (Mac) and ia64. This gets things building on sparc64, I'll stick an e100 into my workstation and use it for everything for a while using this driver. --- Makefile.~1~ 2003-08-04 20:20:42.000000000 -0700 +++ Makefile 2003-08-05 00:12:29.000000000 -0700 @@ -96,10 +96,15 @@ endif # pick a compiler -ifneq (,$(findstring egcs-2.91.66, $(shell cat /proc/version))) - CC := kgcc gcc cc +ARCH := $(shell uname -m | sed 's/i.86/i386/') +ifeq ($(ARCH),sparc64) +CC := $(shell if gcc -m64 -S -o /dev/null -xc /dev/null >/dev/null 2>&1; then echo gcc; else echo sparc64-linux-gcc; fi ) else - CC := gcc cc + ifneq (,$(findstring egcs-2.91.66, $(shell cat /proc/version))) + CC := kgcc gcc cc + else + CC := gcc cc + endif endif test_cc = $(shell which $(cc) > /dev/null 2>&1 && echo $(cc)) CC := $(foreach cc, $(CC), $(test_cc)) @@ -198,10 +203,30 @@ # we need to know what platform the driver is being built on # some additional features are only built on Intel platforms -ARCH := $(shell uname -m | sed 's/i.86/i386/') ifeq ($(ARCH),alpha) CFLAGS += -ffixed-8 -mno-fp-regs endif +ifeq ($(ARCH),sparc64) + NEW_GCC := $(shell if $(CC) -m64 -mcmodel=medlow -S -o /dev/null -xc /dev/null >/dev/null 2>&1; then echo y; else echo n; fi; ) + UNDECLARED_REGS := $(shell if $(CC) -c -x assembler /dev/null -Wa,--help | grep undeclared-regs > /dev/null; then echo y; else echo n; fi; ) + INLINE_LIMIT := $(shell if $(CC) -m64 -finline-limit=100000 -S -o /dev/null -xc /dev/null >/dev/null 2>&1; then echo y; else echo n; fi; ) + ifneq ($(UNDECLARED_REGS),y) + CC_UNDECL = + else + CC_UNDECL = -Wa,--undeclared-regs + endif + ifneq ($(NEW_GCC),y) + CFLAGS += -pipe -mno-fpu -mtune=ultrasparc -mmedlow \ + -ffixed-g4 -fcall-used-g5 -fcall-used-g7 -Wno-sign-compare + else + CFLAGS += -m64 -pipe -mno-fpu -mcpu=ultrasparc -mcmodel=medlow \ + -ffixed-g4 -fcall-used-g5 -fcall-used-g7 -Wno-sign-compare \ + $(CC_UNDECL) + endif + ifeq ($(INLINE_LIMIT),y) + CFLAGS := $(CFLAGS) -finline-limit=100000 + endif +endif # depmod version for rpm builds DEPVER := $(shell /sbin/depmod -V 2>/dev/null | awk 'BEGIN {FS="."} NR==1 {print $$2}') --- e100.c.~1~ 2003-08-04 20:20:42.000000000 -0700 +++ e100.c 2003-08-05 00:13:23.000000000 -0700 @@ -150,6 +150,7 @@ #include #include #include +#include #include "kcompat.h" From felix@allot.com Tue Aug 5 01:23:06 2003 Received: with ECARTIS (v1.0.0; list netdev); Tue, 05 Aug 2003 01:23:14 -0700 (PDT) Received: from mxout3.netvision.net.il (mxout3.netvision.net.il [194.90.9.24]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h758N4Fl001990 for ; Tue, 5 Aug 2003 01:23:05 -0700 Received: from exg.allot.com ([199.203.223.202]) by mxout3.netvision.net.il (iPlanet Messaging Server 5.2 HotFix 1.14 (built Mar 18 2003)) with ESMTP id <0HJ50039I0M9WZ@mxout3.netvision.net.il> for netdev@oss.sgi.com; Tue, 05 Aug 2003 11:22:57 +0300 (IDT) Received: from allot.com (199.203.223.201 [199.203.223.201]) by exg.allot.com with SMTP (Microsoft Exchange Internet Mail Service Version 5.5.2653.13) id QG1CBB3A; Tue, 05 Aug 2003 11:25:57 +0200 Date: Tue, 05 Aug 2003 11:23:22 +0300 From: Felix Radensky Subject: Re: e100 "Ferguson" release To: Ben Greear Cc: Jeff Garzik , "Feldman, Scott" , netdev@oss.sgi.com Message-id: <3F2F697A.2020708@allot.com> Organization: Allot Communications Ltd. MIME-version: 1.0 Content-type: text/plain; charset=us-ascii; format=flowed Content-transfer-encoding: 7BIT X-Accept-Language: en-us, en User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.0.2) Gecko/20030208 Netscape/7.02 References: <3F2CA65F.8060105@pobox.com> <3F2CBA71.2070503@candelatech.com> X-archive-position: 4538 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: felix@allot.com Precedence: bulk X-list: netdev I've also noticed that the number of hard_start_xmit failures in e1000 has increased significantly in version 5.1.13-k1. In version 5.0.43-k1 the number of failures was much smaller. Felix. Ben Greear wrote: > > > > With e100 and e1000, I see the very large numbers of the > hard_start_xmit failure > when running very high packets-per-second rates (small packets). > I see virtually no failures with tulip. pktgen knows how to re-queue, > but it's > curious it has to so often. For code that does not requeue, this > could be even > more of a bummer. > > > From kuznet@ms2.inr.ac.ru Tue Aug 5 06:41:02 2003 Received: with ECARTIS (v1.0.0; list netdev); Tue, 05 Aug 2003 06:41:07 -0700 (PDT) Received: from dub.inr.ac.ru (dub.inr.ac.ru [193.233.7.105]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h75Df0Fl005543 for ; Tue, 5 Aug 2003 06:41:01 -0700 Received: (from kuznet@localhost) by dub.inr.ac.ru (8.6.13/ANK) id RAA28267; Tue, 5 Aug 2003 17:40:42 +0400 From: kuznet@ms2.inr.ac.ru Message-Id: <200308051340.RAA28267@dub.inr.ac.ru> Subject: [PATCH] repairing rtcache killer To: davem@redhat.com, Robert.Olsson@data.slu.se, netdev@oss.sgi.com Date: Tue, 5 Aug 2003 17:40:42 +0400 (MSD) X-Mailer: ELM [version 2.5 PL6] MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit X-archive-position: 4539 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: kuznet@ms2.inr.ac.ru Precedence: bulk X-list: netdev Hello! Alexey # This is a BitKeeper generated patch for the following project: # Project Name: Linux kernel tree # This patch format is intended for GNU patch command version 2.5 or higher. # This patch includes the following deltas: # ChangeSet 1.1613 -> 1.1614 # net/ipv4/route.c 1.66 -> 1.67 # # The following is the BitKeeper ChangeSet Log # -------------------------------------------- # 03/08/05 kuznet@mops.inr.ac.ru 1.1614 # route.c: # [IPV4] Repair calculation of rtcache entries score # # Two serious and interesting mistakes were made in the patch of 2003-06-16. # 1. Variance of hash chain turned out to be unexpectedly high, so truncation # chain length at <=ip_rt_gc_elasticity results in strong growth of # cache misses. Set the threshould to 2*ip_rt_gc_elasticity. # And continue to think how to switch to mode when lots of cache # entries are used once or twice, so truncation should be done at 1. # 2. The selection rt_score() function based on use count resulted in killing # new fresh entries. Actually, it is clear when minimal brain efforts # are applied. :-) So, switch to scoring using last used time, which # should give real LRU behaviour. # -------------------------------------------- # diff -Nru a/net/ipv4/route.c b/net/ipv4/route.c --- a/net/ipv4/route.c Tue Aug 5 17:37:41 2003 +++ b/net/ipv4/route.c Tue Aug 5 17:37:41 2003 @@ -463,7 +463,9 @@ */ static inline u32 rt_score(struct rtable *rt) { - u32 score = rt->u.dst.__use; + u32 score = jiffies - rt->u.dst.lastuse; + + score = ~score & ~(3<<30); if (rt_valuable(rt)) score |= (1<<31); @@ -807,8 +809,7 @@ * The second limit is less certain. At the moment it allows * only 2 entries per bucket. We will see. */ - if (chain_length > ip_rt_gc_elasticity || - (chain_length > 1 && !(min_score & (1<<31)))) { + if (chain_length > 2*ip_rt_gc_elasticity) { *candp = cand->u.rt_next; rt_free(cand); } From vnuorval@tcs.hut.fi Tue Aug 5 07:20:09 2003 Received: with ECARTIS (v1.0.0; list netdev); Tue, 05 Aug 2003 07:20:17 -0700 (PDT) Received: from mail.tcs.hut.fi (mail.tcs.hut.fi [130.233.215.20]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h75EJwFl007784 for ; Tue, 5 Aug 2003 07:19:59 -0700 Received: from rhea.tcs.hut.fi (rhea.tcs.hut.fi [130.233.215.147]) by mail.tcs.hut.fi (Postfix) with ESMTP id E93028001CD; Tue, 5 Aug 2003 16:42:32 +0300 (EEST) Received: from rhea.tcs.hut.fi (localhost [127.0.0.1]) by rhea.tcs.hut.fi (8.12.3/8.12.3/Debian-5) with ESMTP id h75DgW5L031191; Tue, 5 Aug 2003 16:42:32 +0300 Received: from localhost (vnuorval@localhost) by rhea.tcs.hut.fi (8.12.3/8.12.3/Debian-5) with ESMTP id h75DgWhQ031187; Tue, 5 Aug 2003 16:42:32 +0300 Date: Tue, 5 Aug 2003 16:42:32 +0300 (EEST) From: Ville Nuorvala To: davem@redhat.com Cc: netdev@oss.sgi.com Subject: [PATCH] IPV6: Fix bugs in ip6ip6_tnl_xmit() In-Reply-To: Message-ID: MIME-Version: 1.0 Content-Type: MULTIPART/MIXED; BOUNDARY="-377318441-99616309-1060088089=:30970" Content-ID: X-archive-position: 4540 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: vnuorval@tcs.hut.fi Precedence: bulk X-list: netdev This message is in MIME format. The first part should be readable text, while the remaining parts are likely unreadable without MIME-aware tools. Send mail to mime@docserver.cac.washington.edu for more info. ---377318441-99616309-1060088089=:30970 Content-Type: TEXT/PLAIN; CHARSET=US-ASCII Content-ID: Hi, There were two bugs in ip6ip6_tnl_xmit() which are fixed in this patch (made against Linux 2.6.0-test2 + cset 1.1612): - ip6_tunnel must give its own getfrag function to ip6_append_data() - fix dst leakage when encapsulated packet too big Please apply! Thanks, Ville -- Ville Nuorvala Research Assistant, Institute of Digital Communications, Helsinki University of Technology email: vnuorval@tcs.hut.fi, phone: +358 (0)9 451 5257 ---377318441-99616309-1060088089=:30970 Content-Type: TEXT/PLAIN; charset=US-ASCII; name="ip6_tnl_xmit.patch" Content-Transfer-Encoding: BASE64 Content-ID: Content-Description: Content-Disposition: attachment; filename="ip6_tnl_xmit.patch" ZGlmZiAtTnVyIC0tZXhjbHVkZT1SQ1MgLS1leGNsdWRlPUNWUyAtLWV4Y2x1 ZGU9U0NDUyAtLWV4Y2x1ZGU9Qml0S2VlcGVyIC0tZXhjbHVkZT1DaGFuZ2VT ZXQgbGludXgtMi41Lk9MRC9uZXQvaXB2Ni9pcDZfdHVubmVsLmMgbGludXgt Mi41L25ldC9pcHY2L2lwNl90dW5uZWwuYw0KLS0tIGxpbnV4LTIuNS5PTEQv bmV0L2lwdjYvaXA2X3R1bm5lbC5jCVR1ZSBBdWcgIDUgMTU6MTU6MDcgMjAw Mw0KKysrIGxpbnV4LTIuNS9uZXQvaXB2Ni9pcDZfdHVubmVsLmMJVHVlIEF1 ZyAgNSAxNTo0NTo0MSAyMDAzDQpAQCAtNjIxLDYgKzYyMSwxNCBAQA0KIAly ZXR1cm4gb3B0Ow0KIH0NCiANCitzdGF0aWMgaW50IA0KK2lwNmlwNl9nZXRm cmFnKHZvaWQgKmZyb20sIGNoYXIgKnRvLCBpbnQgb2Zmc2V0LCBpbnQgbGVu LCBpbnQgb2RkLCANCisJCXN0cnVjdCBza19idWZmICpza2IpDQorew0KKwlt ZW1jcHkodG8sIChjaGFyICopIGZyb20gKyBvZmZzZXQsIGxlbik7DQorCXJl dHVybiAwOw0KK30NCisNCiAvKioNCiAgKiBpcDZpcDZfdG5sX2FkZHJfY29u ZmxpY3QgLSBjb21wYXJlIHBhY2tldCBhZGRyZXNzZXMgdG8gdHVubmVsJ3Mg b3duDQogICogICBAdDogdGhlIG91dGdvaW5nIHR1bm5lbCBkZXZpY2UNCkBA IC03NTUsOSArNzYzLDkgQEANCiAJfQ0KIAlpZiAoc2tiLT5sZW4gPiBtdHUp IHsNCiAJCWljbXB2Nl9zZW5kKHNrYiwgSUNNUFY2X1BLVF9UT09CSUcsIDAs IG10dSwgZGV2KTsNCi0JCWdvdG8gdHhfZXJyX29wdF9yZWxlYXNlOw0KKwkJ Z290byB0eF9lcnJfZHN0X3JlbGVhc2U7DQogCX0NCi0JZXJyID0gaXA2X2Fw cGVuZF9kYXRhKHNrLCBpcF9nZW5lcmljX2dldGZyYWcsIHNrYi0+bmgucmF3 LCBza2ItPmxlbiwgMCwNCisJZXJyID0gaXA2X2FwcGVuZF9kYXRhKHNrLCBp cDZpcDZfZ2V0ZnJhZywgc2tiLT5uaC5yYXcsIHNrYi0+bGVuLCAwLA0KIAkJ CSAgICAgIHQtPnBhcm1zLmhvcF9saW1pdCwgb3B0LCAmZmwsIA0KIAkJCSAg ICAgIChzdHJ1Y3QgcnQ2X2luZm8gKilkc3QsIE1TR19ET05UV0FJVCk7DQog DQpAQCAtNzg1LDcgKzc5Myw2IEBADQogCXJldHVybiAwOw0KIHR4X2Vycl9k c3RfcmVsZWFzZToNCiAJZHN0X3JlbGVhc2UoZHN0KTsNCi10eF9lcnJfb3B0 X3JlbGVhc2U6DQogCWlmIChvcHQgJiYgb3B0ICE9IG9yaWdfb3B0KQ0KIAkJ c29ja19rZnJlZV9zKHNrLCBvcHQsIG9wdC0+dG90X2xlbik7DQogdHhfZXJy X2ZyZWVfZmxfbGJsOg0KZGlmZiAtTnVyIC0tZXhjbHVkZT1SQ1MgLS1leGNs dWRlPUNWUyAtLWV4Y2x1ZGU9U0NDUyAtLWV4Y2x1ZGU9Qml0S2VlcGVyIC0t ZXhjbHVkZT1DaGFuZ2VTZXQgbGludXgtMi41Lk9MRC9uZXQvbmV0c3ltcy5j IGxpbnV4LTIuNS9uZXQvbmV0c3ltcy5jDQotLS0gbGludXgtMi41Lk9MRC9u ZXQvbmV0c3ltcy5jCVR1ZSBBdWcgIDUgMTU6MTU6MDMgMjAwMw0KKysrIGxp bnV4LTIuNS9uZXQvbmV0c3ltcy5jCVR1ZSBBdWcgIDUgMTM6NTg6MzQgMjAw Mw0KQEAgLTQ4MiwxMCArNDgyLDggQEANCiBFWFBPUlRfU1lNQk9MKHN5c2N0 bF9tYXhfc3luX2JhY2tsb2cpOw0KICNlbmRpZg0KIA0KLSNlbmRpZg0KLQ0K LSNpZiBkZWZpbmVkIChDT05GSUdfSVBWNl9NT0RVTEUpIHx8IGRlZmluZWQg KENPTkZJR19JUF9TQ1RQX01PRFVMRSkgfHwgZGVmaW5lZCAoQ09ORklHX0lQ VjZfVFVOTkVMX01PRFVMRSkNCiBFWFBPUlRfU1lNQk9MKGlwX2dlbmVyaWNf Z2V0ZnJhZyk7DQorDQogI2VuZGlmDQogDQogRVhQT1JUX1NZTUJPTCh0Y3Bf cmVhZF9zb2NrKTsNCg== ---377318441-99616309-1060088089=:30970-- From scott.feldman@intel.com Tue Aug 5 07:29:07 2003 Received: with ECARTIS (v1.0.0; list netdev); Tue, 05 Aug 2003 07:29:10 -0700 (PDT) Received: from hermes.jf.intel.com (fmr05.intel.com [134.134.136.6]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h75ET6Fl008508 for ; Tue, 5 Aug 2003 07:29:06 -0700 Received: from petasus.jf.intel.com (petasus.jf.intel.com [10.7.209.6]) by hermes.jf.intel.com (8.11.6p2/8.11.6/d: outer.mc,v 1.66 2003/05/22 21:17:36 rfjohns1 Exp $) with ESMTP id h75EQrw18915 for ; Tue, 5 Aug 2003 14:26:54 GMT Received: from orsmsxvs041.jf.intel.com (orsmsxvs041.jf.intel.com [192.168.65.54]) by petasus.jf.intel.com (8.11.6p2/8.11.6/d: inner.mc,v 1.35 2003/05/22 21:18:01 rfjohns1 Exp $) with SMTP id h75EO1c21378 for ; Tue, 5 Aug 2003 14:24:01 GMT Received: from orsmsx332.amr.corp.intel.com ([192.168.65.60]) by orsmsxvs041.jf.intel.com (NAVGW 2.5.2.11) with SMTP id M2003080507285928763 ; Tue, 05 Aug 2003 07:28:59 -0700 Received: from orsmsx402.amr.corp.intel.com ([192.168.65.208]) by orsmsx332.amr.corp.intel.com with Microsoft SMTPSVC(5.0.2195.5329); Tue, 5 Aug 2003 07:28:59 -0700 content-class: urn:content-classes:message MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" X-MimeOLE: Produced By Microsoft Exchange V6.0.6375.0 Subject: RE: e100 "Ferguson" release Date: Tue, 5 Aug 2003 07:28:58 -0700 Message-ID: X-MS-Has-Attach: X-MS-TNEF-Correlator: Thread-Topic: e100 "Ferguson" release Thread-Index: AcNbEppm4ua1VvpURRC1DmNP6YxZrAASpVCA From: "Feldman, Scott" To: "Jeff Garzik" Cc: X-OriginalArrivalTime: 05 Aug 2003 14:28:59.0646 (UTC) FILETIME=[E90BD9E0:01C35B5D] Content-Transfer-Encoding: 8bit X-MIME-Autoconverted: from quoted-printable to 8bit by oss.sgi.com id h75ET6Fl008508 X-archive-position: 4541 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: scott.feldman@intel.com Precedence: bulk X-list: netdev > > if(unlikely(e100_exec_cb(nic, skb, e100_xmit_prepare) == -ENOMEM)) { > > netif_stop_queue(netdev); > > nic->net_stats.tx_dropped++; > > dev_kfree_skb(skb); > > return 0; > > } > > Yes. I would also printk(KERN_ERR "we have a bug!") or > somesuch, like several other drivers do, too. It's there, sorry, was trying to keep the code snippet small. > >>* (minor) use a netif_msg_xxx wrapper/constant in > >>e100_init_module test? > > > > > > Can't - don't have nic->msg_enable allocated yet. :( > > You could always use "(1 << debug) - 1"... :) I dunno if it's worth > worrying about. (1 << debug) - 1) & NETIF_MSG_DRV is what's there now. -scott From david-b@pacbell.net Tue Aug 5 08:14:55 2003 Received: with ECARTIS (v1.0.0; list netdev); Tue, 05 Aug 2003 08:15:03 -0700 (PDT) Received: from mta7.pltn13.pbi.net (mta7.pltn13.pbi.net [64.164.98.8]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h75FEsFl012006 for ; Tue, 5 Aug 2003 08:14:55 -0700 Received: from pacbell.net (ppp-67-118-247-188.dialup.pltn13.pacbell.net [67.118.247.188]) by mta7.pltn13.pbi.net (8.12.9/8.12.3) with ESMTP id h75FEgeC006162; Tue, 5 Aug 2003 08:14:43 -0700 (PDT) Message-ID: <3F2E9A09.7000707@pacbell.net> Date: Mon, 04 Aug 2003 10:38:17 -0700 From: David Brownell User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.2.1) Gecko/20030225 X-Accept-Language: en-us, en, fr MIME-Version: 1.0 To: "David S. Miller" CC: greearb@candelatech.com, jgarzik@pobox.com, scott.feldman@intel.com, netdev@oss.sgi.com Subject: Re: e100 "Ferguson" release References: <3F2CA65F.8060105@pobox.com> <3F2CBA71.2070503@candelatech.com> <20030803003239.4257ef24.davem@redhat.com> <3F2DCE56.6030601@pacbell.net> <20030803200851.7d46a605.davem@redhat.com> <3F2DD6BD.7070504@pacbell.net> <20030803204642.684c6075.davem@redhat.com> <3F2DDC3A.2040707@pacbell.net> <20030803211333.12839f66.davem@redhat.com> In-Reply-To: <20030803211333.12839f66.davem@redhat.com> Content-Type: text/plain; charset=us-ascii; format=flowed Content-Transfer-Encoding: 7bit X-archive-position: 4542 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: david-b@pacbell.net Precedence: bulk X-list: netdev David S. Miller wrote: > > For example, what do USB block device drivers do when -ENOMEM comes > back? Do they just drop the request on the floor? No, rather they > resubmit the request later without the scsi/block layer knowing > anything about what happened, right? I didn't notice any code to retry, but I did see some that morphed ENOMEM into a generic scsi "error". Scsi presumably does something more or less intelligent then. The network layer on the other hand _does_ have hooks for retrying, not that they're used much. - Dave From scott.feldman@intel.com Tue Aug 5 08:19:33 2003 Received: with ECARTIS (v1.0.0; list netdev); Tue, 05 Aug 2003 08:19:39 -0700 (PDT) Received: from hermes.jf.intel.com (fmr05.intel.com [134.134.136.6]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h75FJWFl013136 for ; Tue, 5 Aug 2003 08:19:32 -0700 Received: from petasus.jf.intel.com (petasus.jf.intel.com [10.7.209.6]) by hermes.jf.intel.com (8.11.6p2/8.11.6/d: outer.mc,v 1.66 2003/05/22 21:17:36 rfjohns1 Exp $) with ESMTP id h75FHKw28917 for ; Tue, 5 Aug 2003 15:17:20 GMT Received: from orsmsxvs041.jf.intel.com (orsmsxvs041.jf.intel.com [192.168.65.54]) by petasus.jf.intel.com (8.11.6p2/8.11.6/d: inner.mc,v 1.35 2003/05/22 21:18:01 rfjohns1 Exp $) with SMTP id h75FERc21323 for ; Tue, 5 Aug 2003 15:14:27 GMT Received: from orsmsx331.amr.corp.intel.com ([192.168.65.56]) by orsmsxvs041.jf.intel.com (NAVGW 2.5.2.11) with SMTP id M2003080508192629015 ; Tue, 05 Aug 2003 08:19:26 -0700 Received: from orsmsx402.amr.corp.intel.com ([192.168.65.208]) by orsmsx331.amr.corp.intel.com with Microsoft SMTPSVC(5.0.2195.5329); Tue, 5 Aug 2003 08:19:26 -0700 content-class: urn:content-classes:message MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" X-MimeOLE: Produced By Microsoft Exchange V6.0.6375.0 Subject: RE: e100 "Ferguson" release Date: Tue, 5 Aug 2003 08:19:25 -0700 Message-ID: X-MS-Has-Attach: X-MS-TNEF-Correlator: Thread-Topic: e100 "Ferguson" release Thread-Index: AcNbKsmh/6+8q5R8RsOBg+Su6l5c9gANLF4Q From: "Feldman, Scott" To: "Felix Radensky" , "Ben Greear" Cc: "Jeff Garzik" , X-OriginalArrivalTime: 05 Aug 2003 15:19:26.0092 (UTC) FILETIME=[F4F2DCC0:01C35B64] Content-Transfer-Encoding: 8bit X-MIME-Autoconverted: from quoted-printable to 8bit by oss.sgi.com id h75FJWFl013136 X-archive-position: 4543 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: scott.feldman@intel.com Precedence: bulk X-list: netdev > I've also noticed that the number of hard_start_xmit failures > in e1000 has increased significantly in version 5.1.13-k1. In > version 5.0.43-k1 the number of failures was much smaller. Interesting. Felix, would you undo the change[1] below in 5.1.13-k1 and see what happens? With the change below, 5.1.13 would be more aggressive on Tx cleanup, so we'll be quicker waking the queue than before. -scott for(i = 0; i < E1000_MAX_INTR; i++) - if(!e1000_clean_rx_irq(adapter) && + if(!e1000_clean_rx_irq(adapter) & !e1000_clean_tx_irq(adapter)) break; [1] Something still bothers me about this new form where we're mixing a bit-wise operator with logical operands. Should this bother me? From garzik@gtf.org Tue Aug 5 08:24:25 2003 Received: with ECARTIS (v1.0.0; list netdev); Tue, 05 Aug 2003 08:24:30 -0700 (PDT) Received: from havoc.gtf.org (host-64-213-145-173.atlantasolutions.com [64.213.145.173] (may be forged)) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h75FOOFl013723 for ; Tue, 5 Aug 2003 08:24:25 -0700 Received: by havoc.gtf.org (Postfix, from userid 500) id EBC946663; Tue, 5 Aug 2003 11:24:18 -0400 (EDT) Date: Tue, 5 Aug 2003 11:24:18 -0400 From: Jeff Garzik To: "Feldman, Scott" Cc: Felix Radensky , Ben Greear , netdev@oss.sgi.com Subject: Re: e100 "Ferguson" release Message-ID: <20030805152418.GB6695@gtf.org> References: Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.3.28i X-archive-position: 4544 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: jgarzik@pobox.com Precedence: bulk X-list: netdev On Tue, Aug 05, 2003 at 08:19:25AM -0700, Feldman, Scott wrote: > > I've also noticed that the number of hard_start_xmit failures > > in e1000 has increased significantly in version 5.1.13-k1. In > > version 5.0.43-k1 the number of failures was much smaller. > > Interesting. Felix, would you undo the change[1] below in 5.1.13-k1 and > see what happens? With the change below, 5.1.13 would be more > aggressive on Tx cleanup, so we'll be quicker waking the queue than > before. > > -scott > > for(i = 0; i < E1000_MAX_INTR; i++) > - if(!e1000_clean_rx_irq(adapter) && > + if(!e1000_clean_rx_irq(adapter) & > !e1000_clean_tx_irq(adapter)) > break; > > [1] Something still bothers me about this new form where we're mixing a > bit-wise operator with logical operands. Should this bother me? It doesn't matter to the compiler if you make it explicit: unsigned int rx_work = e1000_clean_rx_irq(); unsigned int tx_work = e1000_clean_tx_irq(); if (!rx_work && !tx_work) break; From Robert.Olsson@data.slu.se Tue Aug 5 10:08:34 2003 Received: with ECARTIS (v1.0.0; list netdev); Tue, 05 Aug 2003 10:08:42 -0700 (PDT) Received: from robur.slu.se (robur.slu.se [130.238.98.12]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h75H8UFl019597 for ; Tue, 5 Aug 2003 10:08:33 -0700 Received: (from robert@localhost) by robur.slu.se (8.9.3p2/8.9.3) id TAA27260; Tue, 5 Aug 2003 19:08:23 +0200 From: Robert Olsson MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Message-ID: <16175.58503.134543.310459@robur.slu.se> Date: Tue, 5 Aug 2003 19:08:23 +0200 To: kuznet@ms2.inr.ac.ru Cc: davem@redhat.com, Robert.Olsson@data.slu.se, netdev@oss.sgi.com Subject: [PATCH] repairing rtcache killer In-Reply-To: <200308051340.RAA28267@dub.inr.ac.ru> References: <200308051340.RAA28267@dub.inr.ac.ru> X-Mailer: VM 6.92 under Emacs 19.34.1 X-archive-position: 4545 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: Robert.Olsson@data.slu.se Precedence: bulk X-list: netdev kuznet@ms2.inr.ac.ru writes: > # Two serious and interesting mistakes were made in the patch of 2003-06-16. > # 1. Variance of hash chain turned out to be unexpectedly high, so truncation > # chain length at <=ip_rt_gc_elasticity results in strong growth of > # cache misses. Set the threshould to 2*ip_rt_gc_elasticity. > # And continue to think how to switch to mode when lots of cache > # entries are used once or twice, so truncation should be done at 1. Hello! I'll guess the setting was very much affected by DoS attacs discussion which indicated very different flowlenths compared to our actual measurement for Uppsala University which had 65 pkts per new DST entry. Proably due to the "new" applications and lots of students. For autotuning I think we can have help from a ratio of warm cache hits (in_hit) and misses (in_slow_tot) to set threshhold to trim hash chain lengths. Cheers. --ro From ebiederm@xmission.com Tue Aug 5 10:22:35 2003 Received: with ECARTIS (v1.0.0; list netdev); Tue, 05 Aug 2003 10:22:46 -0700 (PDT) Received: from frodo.biederman.org (ebiederm.dsl.xmission.com [166.70.28.69]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h75HMYFl020574 for ; Tue, 5 Aug 2003 10:22:34 -0700 Received: (from eric@localhost) by frodo.biederman.org (8.9.3/8.9.3) id LAA05791; Tue, 5 Aug 2003 11:19:09 -0600 To: Werner Almesberger Cc: Jeff Garzik , Nivedita Singhvi , netdev@oss.sgi.com, linux-kernel@vger.kernel.org Subject: Re: TOE brain dump References: <20030802140444.E5798@almesberger.net> <3F2BF5C7.90400@us.ibm.com> <3F2C0C44.6020002@pobox.com> <20030802184901.G5798@almesberger.net> <20030804162433.L5798@almesberger.net> From: ebiederm@xmission.com (Eric W. Biederman) Date: 05 Aug 2003 11:19:09 -0600 In-Reply-To: <20030804162433.L5798@almesberger.net> Message-ID: Lines: 68 User-Agent: Gnus/5.09 (Gnus v5.9.0) Emacs/21.1 MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-archive-position: 4546 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: ebiederm@xmission.com Precedence: bulk X-list: netdev Werner Almesberger writes: > Eric W. Biederman wrote: > > The optimized for low latency cases seem to have a strong > > market in clusters. > > Clusters have captive, no, _desperate_ customers ;-) And it > seems that people are just as happy putting MPI as their > transport on top of all those link-layer technologies. MPI is not a transport. It an interface like the Berkeley sockets layer. The semantics it wants right now are usually mapped to TCP/IP when used on an IP network. Though I suspect SCTP might be a better fit. But right now nothing in the IP stack is a particularly good fit. Right now there is a very strong feeling among most of the people using and developing on clusters that by and large what they are doing is not of interest to the general kernel community, and so has no chance of going in. So you see hack piled on top of hack piled on top of hack. Mostly I think the that is less true, at least if they can stand the process of severe code review and cleaning up their code. If we can put in code to scale the kernel to 64 processors. NIC drivers for fast interconnects and a few similar tweaks can't hurt either. But of course to get through the peer review process people need to understand what they are doing. > > There is one place in low latency communications that I can think > > of where TCP/IP is not the proper solution. For low latency > > communication the checksum is at the wrong end of the packet. > > That's one of the few things ATM's AAL5 got right. But in the end, > I think it doesn't really matter. At 1 Gbps, an MTU-sized packet > flies by within 13 us. At 10 Gbps, it's only 1.3 us. At that point, > you may well treat it as an atomic unit. So store and forward of packets in a 3 layer switch hierarchy, at 1.3 us per copy. 1.3us to the NIC + 1.3us to the first switch chip + 1.3us to the second switch chip + 1.3us to the top level switch chip + 1.3us to a middle layer switch chip + 1.3us to the receiving NIC + 1.3us the receiver. 1.3us * 7 = 9.1us to deliver a packet to the other side. That is still quite painful. Right now I can get better latencies over any of the cluster interconnects. I think 5 us is the current low end, with the high end being about 1 us. Quite often in MPI when a message is sent the program cannot continue until the reply is received. Possibly this is a fundamental problem with the application programming model, encouraging applications to be latency sensitive. But it is a well established API and programming paradigm so it has to be lived with. All of this is pretty much the reverse of the TOE case. Things are latency sensitive because real work needs to be done. And the more latency you have the slower that work gets done. A lot of the NICs which are used for MPI tend to be smart for two reasons. 1) So they can do source routing. 2) So they can safely export some of their interface to user space, so in the fast path they can bypass the kernel. Eric From ebiederm@xmission.com Tue Aug 5 10:29:21 2003 Received: with ECARTIS (v1.0.0; list netdev); Tue, 05 Aug 2003 10:29:26 -0700 (PDT) Received: from frodo.biederman.org (ebiederm.dsl.xmission.com [166.70.28.69]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h75HTLFl021304 for ; Tue, 5 Aug 2003 10:29:21 -0700 Received: (from eric@localhost) by frodo.biederman.org (8.9.3/8.9.3) id LAA05837; Tue, 5 Aug 2003 11:25:57 -0600 To: "David S. Miller" Cc: Werner Almesberger jgarzik@pobox.com, niv@us.ibm.com, netdev@oss.sgi.com, linux-kernel@vger.kernel.org Subject: Re: TOE brain dump References: <20030802140444.E5798@almesberger.net> <3F2BF5C7.90400@us.ibm.com> <3F2C0C44.6020002@pobox.com> <20030802184901.G5798@almesberger.net> <20030804162433.L5798@almesberger.net> <20030804122632.65ba2122.davem@redhat.com> From: ebiederm@xmission.com (Eric W. Biederman) Date: 05 Aug 2003 11:25:57 -0600 In-Reply-To: <20030804122632.65ba2122.davem@redhat.com> Message-ID: Lines: 48 User-Agent: Gnus/5.09 (Gnus v5.9.0) Emacs/21.1 MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-archive-position: 4547 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: ebiederm@xmission.com Precedence: bulk X-list: netdev "David S. Miller" writes: > On Mon, 4 Aug 2003 16:24:33 -0300 > Werner Almesberger wrote: > > > Eric W. Biederman wrote: > > > There is one place in low latency communications that I can think > > > of where TCP/IP is not the proper solution. For low latency > > > communication the checksum is at the wrong end of the packet. > > > > That's one of the few things ATM's AAL5 got right. > > Let's recall how long the IFF_TRAILERS hack from BSD :-) Putting the variable length headers on the end of a packet? Or was that something other than RFC893? I think IPv6 solves that much more cleanly by simply deleting them. > > But in the end, I think it doesn't really matter. > > I tend to agree on this one. > > And on the transmit side if you have more than 1 pending TX frame, you > can always be prefetching the next one into the fifo so that by the > time the medium is ready all the checksum bits have been done. For large data transmissions that happens. > In fact I'd be surprised if current generation 1g/10g cards are not > doing something like this. Well at this point before I propose anything concrete I suspect I need to profile some actual application and see how things go. But from a very latency sensitive perspective, I would be surprised if the problem goes away with faster technology. For now I am happy just to insert the peculiar thought that latency across the entire cluster/lan is of great importance to some applications. Eric From ingo.oeser@informatik.tu-chemnitz.de Tue Aug 5 10:33:50 2003 Received: with ECARTIS (v1.0.0; list netdev); Tue, 05 Aug 2003 10:33:54 -0700 (PDT) Received: from meg.hrz.tu-chemnitz.de (meg.hrz.tu-chemnitz.de [134.109.132.57]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h75HXmFl021963 for ; Tue, 5 Aug 2003 10:33:50 -0700 Received: from tnt188.hrz.tu-chemnitz.de ([134.109.156.188] helo=nightmaster.csn.tu-chemnitz.de ident=root) by meg.hrz.tu-chemnitz.de with esmtp (Exim 4.12) id 19jhum-0003rB-00; Mon, 04 Aug 2003 18:10:30 +0200 Received: (from ioe@localhost) by nightmaster.csn.tu-chemnitz.de (8.9.1/8.9.1) id QAA23195; Mon, 4 Aug 2003 16:36:06 +0200 Date: Mon, 4 Aug 2003 16:36:06 +0200 From: Ingo Oeser To: Jeff Garzik Cc: Nivedita Singhvi , Werner Almesberger , netdev@oss.sgi.com, linux-kernel@vger.kernel.org Subject: Re: TOE brain dump Message-ID: <20030804163606.Q639@nightmaster.csn.tu-chemnitz.de> References: <20030802140444.E5798@almesberger.net> <3F2BF5C7.90400@us.ibm.com> <3F2C0C44.6020002@pobox.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.2i In-Reply-To: <3F2C0C44.6020002@pobox.com>; from jgarzik@pobox.com on Sat, Aug 02, 2003 at 03:08:52PM -0400 X-Spam-Score: -5.0 (-----) X-Scanner: exiscan for exim4 (http://duncanthrax.net/exiscan/) *19jhum-0003rB-00*vFn3hP0u2Ks* X-archive-position: 4548 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: ingo.oeser@informatik.tu-chemnitz.de Precedence: bulk X-list: netdev Hi Jeff, On Sat, Aug 02, 2003 at 03:08:52PM -0400, Jeff Garzik wrote: > So, fix the other end of the pipeline too, otherwise this fast network > stuff is flashly but pointless. If you want to serve up data from disk, > then start creating PCI cards that have both Serial ATA and ethernet > connectors on them :) Cut out the middleman of the host CPU and host > memory bus instead of offloading portions of TCP that do not need to be > offloaded. Exactly what I suggested: sys_ioroute() "Providing generic pipelines and io routing as Linux service" Msg-ID: <20030718134235.K639@nightmaster.csn.tu-chemnitz.de> on linux-kernel and linux-fsdevel Be my guest. I know, that you mean doing it in hardware, but you cannot accelerate sth. which the kernel doesn't do ;-) Regards Ingo Oeser From miller@techsource.com Tue Aug 5 12:15:39 2003 Received: with ECARTIS (v1.0.0; list netdev); Tue, 05 Aug 2003 12:15:50 -0700 (PDT) Received: from kinesis.swishmail.com (qmailr@kinesis.swishmail.com [209.10.110.86]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h75JFcFl028716 for ; Tue, 5 Aug 2003 12:15:39 -0700 Received: (qmail 42158 invoked by uid 89); 5 Aug 2003 19:15:37 -0000 Received: from unknown (HELO techsource.com) (209.208.48.130) by kinesis.swishmail.com with SMTP; 5 Aug 2003 19:15:37 -0000 Message-ID: <3F300549.60800@techsource.com> Date: Tue, 05 Aug 2003 15:28:09 -0400 From: Timothy Miller User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.0.1) Gecko/20020823 Netscape/7.0 X-Accept-Language: en-us, en MIME-Version: 1.0 To: Larry McVoy CC: David Lang , Erik Andersen , Werner Almesberger , Jeff Garzik , netdev@oss.sgi.com, linux-kernel@vger.kernel.org, Nivedita Singhvi Subject: Re: TOE brain dump References: <20030803194011.GA8324@work.bitmover.com> <20030803203051.GA9057@work.bitmover.com> Content-Type: text/plain; charset=us-ascii; format=flowed Content-Transfer-Encoding: 7bit X-archive-position: 4549 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: miller@techsource.com Precedence: bulk X-list: netdev Larry McVoy wrote: > On Sun, Aug 03, 2003 at 01:13:24PM -0700, David Lang wrote: > >>2. router nodes that have access to main memory (PCI card running linux >>acting as a router/firewall/VPN to offload the main CPU's) > > > I can get an entire machine, memory, disk, > Ghz CPU, case, power supply, > cdrom, floppy, onboard enet extra net card for routing, for $250 or less, > quantity 1, shipped to my door. > > Why would I want to spend money on some silly offload card when I can get > the whole PC for less than the card? Physical space? Power usage? Heat dissipation? Optimization for the specific task? Fast, low latency communication between CPU and device (ie. local bus)? Maintenance? Lots of reasons why one might pay more for the offload card. If you're cheap, you'll just use the software stack and a $10 NIC and just live with the corresponding CPU usage. If you're a performance freak, you'll spend whatever you have to to squeeze out every last bit of performance you can. Mind you, another option is, if you're dealing with the kind of load that requires that much network performance, is to use redundant servers, like google. No one server is exceptionally fast, but it not many people are using it, it's fast enough. From shemminger@osdl.org Tue Aug 5 14:46:38 2003 Received: with ECARTIS (v1.0.0; list netdev); Tue, 05 Aug 2003 14:46:44 -0700 (PDT) Received: from mail.osdl.org (fw.osdl.org [65.172.181.6]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h75LkbFl005904 for ; Tue, 5 Aug 2003 14:46:37 -0700 Received: from dell_ss3.pdx.osdl.net (dell_ss3.pdx.osdl.net [172.20.1.60]) by mail.osdl.org (8.11.6/8.11.6) with SMTP id h75LkNI01529; Tue, 5 Aug 2003 14:46:23 -0700 Date: Tue, 5 Aug 2003 14:46:22 -0700 From: Stephen Hemminger To: Ralf Baechle , "David S. Miller" Cc: linux-hams@vger.kernel.org, netdev@oss.sgi.com Subject: [PATCH] (2/2) Convert ROSE to seq_file Message-Id: <20030805144622.100f208d.shemminger@osdl.org> Organization: Open Source Development Lab X-Mailer: Sylpheed version 0.9.3claws (GTK+ 1.2.10; i686-pc-linux-gnu) X-Face: &@E+xe?c%:&e4D{>f1O<&U>2qwRREG5!}7R4;D<"NO^UI2mJ[eEOA2*3>(`Th.yP,VDPo9$ /`~cw![cmj~~jWe?AHY7D1S+\}5brN0k*NE?pPh_'_d>6;XGG[\KDRViCfumZT3@[ Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit X-archive-position: 4551 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: shemminger@osdl.org Precedence: bulk X-list: netdev The existing ROSE /proc interface has no module owner, and doesn't check for bounds overflow. Easier to just convert it to the seq_file wrapper functions. This patch is against 2.6.0-test2 (offsets assume earlier patch). diff -Nru a/include/net/rose.h b/include/net/rose.h --- a/include/net/rose.h Tue Aug 5 14:36:07 2003 +++ b/include/net/rose.h Tue Aug 5 14:36:07 2003 @@ -140,6 +140,9 @@ #define rose_sk(__sk) ((rose_cb *)(__sk)->sk_protinfo) +/* Magic value indicating first entry in /proc (ie header) */ +#define ROSE_PROC_START ((void *) 1) + /* af_rose.c */ extern ax25_address rose_callsign; extern int sysctl_rose_restart_request_timeout; @@ -154,7 +157,7 @@ extern int sysctl_rose_window_size; extern int rosecmp(rose_address *, rose_address *); extern int rosecmpm(rose_address *, rose_address *, unsigned short); -extern char *rose2asc(rose_address *); +extern const char *rose2asc(const rose_address *); extern struct sock *rose_find_socket(unsigned int, struct rose_neigh *); extern void rose_kill_by_neigh(struct rose_neigh *); extern unsigned int rose_new_lci(struct rose_neigh *); @@ -193,6 +196,9 @@ /* rose_route.c */ extern struct rose_neigh *rose_loopback_neigh; +extern struct file_operations rose_neigh_fops; +extern struct file_operations rose_nodes_fops; +extern struct file_operations rose_routes_fops; extern int rose_add_loopback_neigh(void); extern int rose_add_loopback_node(rose_address *); @@ -207,9 +213,6 @@ extern int rose_rt_ioctl(unsigned int, void *); extern void rose_link_failed(ax25_cb *, int); extern int rose_route_frame(struct sk_buff *, ax25_cb *); -extern int rose_nodes_get_info(char *, char **, off_t, int); -extern int rose_neigh_get_info(char *, char **, off_t, int); -extern int rose_routes_get_info(char *, char **, off_t, int); extern void rose_rt_free(void); /* rose_subr.c */ diff -Nru a/net/rose/af_rose.c b/net/rose/af_rose.c --- a/net/rose/af_rose.c Tue Aug 5 14:36:07 2003 +++ b/net/rose/af_rose.c Tue Aug 5 14:36:07 2003 @@ -39,6 +39,7 @@ #include #include #include +#include #include #include #include @@ -56,8 +57,8 @@ int sysctl_rose_maximum_vcs = ROSE_DEFAULT_MAXVC; int sysctl_rose_window_size = ROSE_DEFAULT_WINDOW_SIZE; -static HLIST_HEAD(rose_list); -static spinlock_t rose_list_lock = SPIN_LOCK_UNLOCKED; +HLIST_HEAD(rose_list); +spinlock_t rose_list_lock = SPIN_LOCK_UNLOCKED; static struct proto_ops rose_proto_ops; @@ -66,7 +67,7 @@ /* * Convert a ROSE address into text. */ -char *rose2asc(rose_address *addr) +const char *rose2asc(const rose_address *addr) { static char buffer[11]; @@ -1332,29 +1333,57 @@ return 0; } -static int rose_get_info(char *buffer, char **start, off_t offset, int length) +#ifdef CONFIG_PROC_FS +static void *rose_info_start(struct seq_file *seq, loff_t *pos) { + int i; struct sock *s; struct hlist_node *node; - struct net_device *dev; - const char *devname, *callsign; - int len = 0; - off_t pos = 0; - off_t begin = 0; spin_lock_bh(&rose_list_lock); + if (*pos == 0) + return ROSE_PROC_START; + + i = 1; + sk_for_each(s, node, &rose_list) { + if (i == *pos) + return s; + ++i; + } + return NULL; +} - len += sprintf(buffer, "dest_addr dest_call src_addr src_call dev lci neigh st vs vr va t t1 t2 t3 hb idle Snd-Q Rcv-Q inode\n"); +static void *rose_info_next(struct seq_file *seq, void *v, loff_t *pos) +{ + ++*pos; - sk_for_each(s, node, &rose_list) { + return (v == ROSE_PROC_START) ? sk_head(&rose_list) + : sk_next((struct sock *)v); +} + +static void rose_info_stop(struct seq_file *seq, void *v) +{ + spin_unlock_bh(&rose_list_lock); +} + +static int rose_info_show(struct seq_file *seq, void *v) +{ + if (v == ROSE_PROC_START) + seq_puts(seq, + "dest_addr dest_call src_addr src_call dev lci neigh st vs vr va t t1 t2 t3 hb idle Snd-Q Rcv-Q inode\n"); + + else { + struct sock *s = v; rose_cb *rose = rose_sk(s); + const char *devname, *callsign; + const struct net_device *dev = rose->device; - if ((dev = rose->device) == NULL) + if (!dev) devname = "???"; else devname = dev->name; - - len += sprintf(buffer + len, "%-10s %-9s ", + + seq_printf(seq, "%-10s %-9s ", rose2asc(&rose->dest_addr), ax2asc(&rose->dest_call)); @@ -1363,7 +1392,8 @@ else callsign = ax2asc(&rose->source_call); - len += sprintf(buffer + len, "%-10s %-9s %-5s %3.3X %05d %d %d %d %d %3lu %3lu %3lu %3lu %3lu %3lu/%03lu %5d %5d %ld\n", + seq_printf(seq, + "%-10s %-9s %-5s %3.3X %05d %d %d %d %d %3lu %3lu %3lu %3lu %3lu %3lu/%03lu %5d %5d %ld\n", rose2asc(&rose->source_addr), callsign, devname, @@ -1383,27 +1413,32 @@ atomic_read(&s->sk_wmem_alloc), atomic_read(&s->sk_rmem_alloc), s->sk_socket ? SOCK_INODE(s->sk_socket)->i_ino : 0L); - - pos = begin + len; - - if (pos < offset) { - len = 0; - begin = pos; - } - - if (pos > offset + length) - break; } - spin_unlock_bh(&rose_list_lock); - *start = buffer + (offset - begin); - len -= (offset - begin); + return 0; +} - if (len > length) len = length; +static struct seq_operations rose_info_seqops = { + .start = rose_info_start, + .next = rose_info_next, + .stop = rose_info_stop, + .show = rose_info_show, +}; - return len; +static int rose_info_open(struct inode *inode, struct file *file) +{ + return seq_open(file, &rose_info_seqops); } +static struct file_operations rose_info_fops = { + .owner = THIS_MODULE, + .open = rose_info_open, + .read = seq_read, + .llseek = seq_lseek, + .release = seq_release, +}; +#endif /* CONFIG_PROC_FS */ + static struct net_proto_family rose_family_ops = { .family = PF_ROSE, .create = rose_create, @@ -1499,10 +1534,11 @@ rose_add_loopback_neigh(); - proc_net_create("rose", 0, rose_get_info); - proc_net_create("rose_neigh", 0, rose_neigh_get_info); - proc_net_create("rose_nodes", 0, rose_nodes_get_info); - proc_net_create("rose_routes", 0, rose_routes_get_info); + proc_net_fops_create("rose", S_IRUGO, &rose_info_fops); + proc_net_fops_create("rose_neigh", S_IRUGO, &rose_neigh_fops); + proc_net_fops_create("rose_nodes", S_IRUGO, &rose_nodes_fops); + proc_net_fops_create("rose_routes", S_IRUGO, &rose_routes_fops); + return 0; } module_init(rose_proto_init); diff -Nru a/net/rose/rose_route.c b/net/rose/rose_route.c --- a/net/rose/rose_route.c Tue Aug 5 14:36:07 2003 +++ b/net/rose/rose_route.c Tue Aug 5 14:36:07 2003 @@ -35,12 +35,13 @@ #include #include #include +#include static unsigned int rose_neigh_no = 1; static struct rose_node *rose_node_list; static spinlock_t rose_node_list_lock = SPIN_LOCK_UNLOCKED; -static struct rose_neigh *rose_neigh_list; +struct rose_neigh *rose_neigh_list; static spinlock_t rose_neigh_list_lock = SPIN_LOCK_UNLOCKED; static struct rose_route *rose_route_list; static spinlock_t rose_route_list_lock = SPIN_LOCK_UNLOCKED; @@ -1066,165 +1067,248 @@ return res; } -int rose_nodes_get_info(char *buffer, char **start, off_t offset, int length) +#ifdef CONFIG_PROC_FS + +static void *rose_node_start(struct seq_file *seq, loff_t *pos) { struct rose_node *rose_node; - int len = 0; - off_t pos = 0; - off_t begin = 0; - int i; + int i = 1; spin_lock_bh(&rose_neigh_list_lock); + if (*pos == 0) + return ROSE_PROC_START; + + for (rose_node = rose_node_list; rose_node && i < *pos; + rose_node = rose_node->next, ++i); + + return (i == *pos) ? rose_node : NULL; +} - len += sprintf(buffer, "address mask n neigh neigh neigh\n"); +static void *rose_node_next(struct seq_file *seq, void *v, loff_t *pos) +{ + ++*pos; + + return (v == ROSE_PROC_START) ? rose_node_list + : ((struct rose_node *)v)->next; +} - for (rose_node = rose_node_list; rose_node != NULL; rose_node = rose_node->next) { +static void rose_node_stop(struct seq_file *seq, void *v) +{ + spin_unlock_bh(&rose_neigh_list_lock); +} + +static int rose_node_show(struct seq_file *seq, void *v) +{ + int i; + + if (v == ROSE_PROC_START) + seq_puts(seq, "address mask n neigh neigh neigh\n"); + else { + const struct rose_node *rose_node = v; /* if (rose_node->loopback) { - len += sprintf(buffer + len, "%-10s %04d 1 loopback\n", + seq_printf(seq, "%-10s %04d 1 loopback\n", rose2asc(&rose_node->address), rose_node->mask); } else { */ - len += sprintf(buffer + len, "%-10s %04d %d", + seq_printf(seq, "%-10s %04d %d", rose2asc(&rose_node->address), rose_node->mask, rose_node->count); for (i = 0; i < rose_node->count; i++) - len += sprintf(buffer + len, " %05d", + seq_printf(seq, " %05d", rose_node->neighbour[i]->number); - len += sprintf(buffer + len, "\n"); + seq_puts(seq, "\n"); /* } */ + } + return 0; +} - pos = begin + len; +static struct seq_operations rose_node_seqops = { + .start = rose_node_start, + .next = rose_node_next, + .stop = rose_node_stop, + .show = rose_node_show, +}; + +static int rose_nodes_open(struct inode *inode, struct file *file) +{ + return seq_open(file, &rose_node_seqops); +} + +struct file_operations rose_nodes_fops = { + .owner = THIS_MODULE, + .open = rose_nodes_open, + .read = seq_read, + .llseek = seq_lseek, + .release = seq_release, +}; - if (pos < offset) { - len = 0; - begin = pos; - } +static void *rose_neigh_start(struct seq_file *seq, loff_t *pos) +{ + struct rose_neigh *rose_neigh; + int i = 1; - if (pos > offset + length) - break; - } - spin_unlock_bh(&rose_neigh_list_lock); + spin_lock_bh(&rose_neigh_list_lock); + if (*pos == 0) + return ROSE_PROC_START; - *start = buffer + (offset - begin); - len -= (offset - begin); + for (rose_neigh = rose_neigh_list; rose_neigh && i < *pos; + rose_neigh = rose_neigh->next, ++i); - if (len > length) - len = length; + return (i == *pos) ? rose_neigh : NULL; +} - return len; +static void *rose_neigh_next(struct seq_file *seq, void *v, loff_t *pos) +{ + ++*pos; + + return (v == ROSE_PROC_START) ? rose_neigh_list + : ((struct rose_neigh *)v)->next; } -int rose_neigh_get_info(char *buffer, char **start, off_t offset, int length) +static void rose_neigh_stop(struct seq_file *seq, void *v) { - struct rose_neigh *rose_neigh; - int len = 0; - off_t pos = 0; - off_t begin = 0; - int i; + spin_unlock_bh(&rose_neigh_list_lock); +} - spin_lock_bh(&rose_neigh_list_lock); +static int rose_neigh_show(struct seq_file *seq, void *v) +{ + int i; - len += sprintf(buffer, "addr callsign dev count use mode restart t0 tf digipeaters\n"); + if (v == ROSE_PROC_START) + seq_puts(seq, + "addr callsign dev count use mode restart t0 tf digipeaters\n"); + else { + struct rose_neigh *rose_neigh = v; - for (rose_neigh = rose_neigh_list; rose_neigh != NULL; rose_neigh = rose_neigh->next) { /* if (!rose_neigh->loopback) { */ - len += sprintf(buffer + len, "%05d %-9s %-4s %3d %3d %3s %3s %3lu %3lu", - rose_neigh->number, - (rose_neigh->loopback) ? "RSLOOP-0" : ax2asc(&rose_neigh->callsign), - rose_neigh->dev ? rose_neigh->dev->name : "???", - rose_neigh->count, - rose_neigh->use, - (rose_neigh->dce_mode) ? "DCE" : "DTE", - (rose_neigh->restarted) ? "yes" : "no", - ax25_display_timer(&rose_neigh->t0timer) / HZ, - ax25_display_timer(&rose_neigh->ftimer) / HZ); - - if (rose_neigh->digipeat != NULL) { - for (i = 0; i < rose_neigh->digipeat->ndigi; i++) - len += sprintf(buffer + len, " %s", ax2asc(&rose_neigh->digipeat->calls[i])); - } - - len += sprintf(buffer + len, "\n"); - - pos = begin + len; - - if (pos < offset) { - len = 0; - begin = pos; - } + seq_printf(seq, "%05d %-9s %-4s %3d %3d %3s %3s %3lu %3lu", + rose_neigh->number, + (rose_neigh->loopback) ? "RSLOOP-0" : ax2asc(&rose_neigh->callsign), + rose_neigh->dev ? rose_neigh->dev->name : "???", + rose_neigh->count, + rose_neigh->use, + (rose_neigh->dce_mode) ? "DCE" : "DTE", + (rose_neigh->restarted) ? "yes" : "no", + ax25_display_timer(&rose_neigh->t0timer) / HZ, + ax25_display_timer(&rose_neigh->ftimer) / HZ); + + if (rose_neigh->digipeat != NULL) { + for (i = 0; i < rose_neigh->digipeat->ndigi; i++) + seq_printf(seq, " %s", ax2asc(&rose_neigh->digipeat->calls[i])); + } - if (pos > offset + length) - break; - /* } */ + seq_puts(seq, "\n"); } + return 0; +} - spin_unlock_bh(&rose_neigh_list_lock); - - *start = buffer + (offset - begin); - len -= (offset - begin); - if (len > length) - len = length; +static struct seq_operations rose_neigh_seqops = { + .start = rose_neigh_start, + .next = rose_neigh_next, + .stop = rose_neigh_stop, + .show = rose_neigh_show, +}; - return len; +static int rose_neigh_open(struct inode *inode, struct file *file) +{ + return seq_open(file, &rose_neigh_seqops); } -int rose_routes_get_info(char *buffer, char **start, off_t offset, int length) +struct file_operations rose_neigh_fops = { + .owner = THIS_MODULE, + .open = rose_neigh_open, + .read = seq_read, + .llseek = seq_lseek, + .release = seq_release, +}; + + +static void *rose_route_start(struct seq_file *seq, loff_t *pos) { struct rose_route *rose_route; - int len = 0; - off_t pos = 0; - off_t begin = 0; + int i = 1; spin_lock_bh(&rose_route_list_lock); + if (*pos == 0) + return ROSE_PROC_START; - len += sprintf(buffer, "lci address callsign neigh <-> lci address callsign neigh\n"); + for (rose_route = rose_route_list; rose_route && i < *pos; + rose_route = rose_route->next, ++i); - for (rose_route = rose_route_list; rose_route != NULL; rose_route = rose_route->next) { - if (rose_route->neigh1 != NULL) { - len += sprintf(buffer + len, "%3.3X %-10s %-9s %05d ", - rose_route->lci1, - rose2asc(&rose_route->src_addr), - ax2asc(&rose_route->src_call), - rose_route->neigh1->number); - } else { - len += sprintf(buffer + len, "000 * * 00000 "); - } + return (i == *pos) ? rose_route : NULL; +} + +static void *rose_route_next(struct seq_file *seq, void *v, loff_t *pos) +{ + ++*pos; + + return (v == ROSE_PROC_START) ? rose_route_list + : ((struct rose_route *)v)->next; +} - if (rose_route->neigh2 != NULL) { - len += sprintf(buffer + len, "%3.3X %-10s %-9s %05d\n", +static void rose_route_stop(struct seq_file *seq, void *v) +{ + spin_unlock_bh(&rose_route_list_lock); +} + +static int rose_route_show(struct seq_file *seq, void *v) +{ + if (v == ROSE_PROC_START) + seq_puts(seq, + "lci address callsign neigh <-> lci address callsign neigh\n"); + else { + struct rose_route *rose_route = v; + + if (rose_route->neigh1) + seq_printf(seq, + "%3.3X %-10s %-9s %05d ", + rose_route->lci1, + rose2asc(&rose_route->src_addr), + ax2asc(&rose_route->src_call), + rose_route->neigh1->number); + else + seq_puts(seq, + "000 * * 00000 "); + + if (rose_route->neigh2) + seq_printf(seq, + "%3.3X %-10s %-9s %05d\n", rose_route->lci2, rose2asc(&rose_route->dest_addr), ax2asc(&rose_route->dest_call), rose_route->neigh2->number); - } else { - len += sprintf(buffer + len, "000 * * 00000\n"); - } - - pos = begin + len; - - if (pos < offset) { - len = 0; - begin = pos; + else + seq_puts(seq, + "000 * * 00000\n"); } + return 0; +} - if (pos > offset + length) - break; - } - - spin_unlock_bh(&rose_route_list_lock); - - *start = buffer + (offset - begin); - len -= (offset - begin); - - if (len > length) - len = length; +static struct seq_operations rose_route_seqops = { + .start = rose_route_start, + .next = rose_route_next, + .stop = rose_route_stop, + .show = rose_route_show, +}; + +static int rose_route_open(struct inode *inode, struct file *file) +{ + return seq_open(file, &rose_route_seqops); +} + +struct file_operations rose_routes_fops = { + .owner = THIS_MODULE, + .open = rose_route_open, + .read = seq_read, + .llseek = seq_lseek, + .release = seq_release, +}; - return len; -} +#endif /* CONFIG_PROC_FS */ /* * Release all memory associated with ROSE routing structures. From shemminger@osdl.org Tue Aug 5 14:46:35 2003 Received: with ECARTIS (v1.0.0; list netdev); Tue, 05 Aug 2003 14:46:44 -0700 (PDT) Received: from mail.osdl.org (fw.osdl.org [65.172.181.6]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h75LkYFl005901 for ; Tue, 5 Aug 2003 14:46:35 -0700 Received: from dell_ss3.pdx.osdl.net (dell_ss3.pdx.osdl.net [172.20.1.60]) by mail.osdl.org (8.11.6/8.11.6) with SMTP id h75LkHI01525; Tue, 5 Aug 2003 14:46:18 -0700 Date: Tue, 5 Aug 2003 14:46:17 -0700 From: Stephen Hemminger To: Ralf Baechle , "David S. Miller" Cc: linux-hams@vger.kernel.org, netdev@oss.sgi.com Subject: [PATCH 2.6.0-test2] (1/2) Dynamically allocate net_device structures for ROSE Message-Id: <20030805144617.2e856d6d.shemminger@osdl.org> Organization: Open Source Development Lab X-Mailer: Sylpheed version 0.9.3claws (GTK+ 1.2.10; i686-pc-linux-gnu) X-Face: &@E+xe?c%:&e4D{>f1O<&U>2qwRREG5!}7R4;D<"NO^UI2mJ[eEOA2*3>(`Th.yP,VDPo9$ /`~cw![cmj~~jWe?AHY7D1S+\}5brN0k*NE?pPh_'_d>6;XGG[\KDRViCfumZT3@[ Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit X-archive-position: 4550 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: shemminger@osdl.org Precedence: bulk X-list: netdev This patch changes the ROSE protocol to allocate an array of pointers and each network device separately. This sets up later change where network_device object's are released on last use which may be after the module is unloaded. The patch is against 2.6.0-test2 (though this code hasn't changed in a long time). Allocation is done via alloc_netdev so the dev->priv area is already reserved and doesn't need to be allocated separately. diff -Nru a/include/net/rose.h b/include/net/rose.h --- a/include/net/rose.h Tue Aug 5 14:35:52 2003 +++ b/include/net/rose.h Tue Aug 5 14:35:52 2003 @@ -163,7 +163,7 @@ /* rose_dev.c */ extern int rose_rx_ip(struct sk_buff *, struct net_device *); -extern int rose_init(struct net_device *); +extern void rose_setup(struct net_device *); /* rose_in.c */ extern int rose_process_rx_frame(struct sock *, struct sk_buff *); diff -Nru a/net/rose/af_rose.c b/net/rose/af_rose.c --- a/net/rose/af_rose.c Tue Aug 5 14:35:52 2003 +++ b/net/rose/af_rose.c Tue Aug 5 14:35:52 2003 @@ -43,7 +43,7 @@ #include #include -int rose_ndevs = 10; +static int rose_ndevs = 10; int sysctl_rose_restart_request_timeout = ROSE_DEFAULT_T0; int sysctl_rose_call_request_timeout = ROSE_DEFAULT_T1; @@ -56,7 +56,7 @@ int sysctl_rose_maximum_vcs = ROSE_DEFAULT_MAXVC; int sysctl_rose_window_size = ROSE_DEFAULT_WINDOW_SIZE; -HLIST_HEAD(rose_list); +static HLIST_HEAD(rose_list); static spinlock_t rose_list_lock = SPIN_LOCK_UNLOCKED; static struct proto_ops rose_proto_ops; @@ -1435,7 +1435,7 @@ .notifier_call = rose_device_event, }; -static struct net_device *dev_rose; +static struct net_device **dev_rose; static const char banner[] = KERN_INFO "F6FBB/G4KLX ROSE for Linux. Version 0.62 for AX25.037 Linux 2.4\n"; @@ -1450,17 +1450,39 @@ return -1; } - if ((dev_rose = kmalloc(rose_ndevs * sizeof(struct net_device), GFP_KERNEL)) == NULL) { + dev_rose = kmalloc(rose_ndevs * sizeof(struct net_device *), GFP_KERNEL); + if (dev_rose == NULL) { printk(KERN_ERR "ROSE: rose_proto_init - unable to allocate device structure\n"); return -1; } - memset(dev_rose, 0x00, rose_ndevs * sizeof(struct net_device)); + memset(dev_rose, 0x00, rose_ndevs * sizeof(struct net_device*)); + for (i = 0; i < rose_ndevs; i++) { + struct net_device *dev; + char name[IFNAMSIZ]; + + sprintf(name, "rose%d", i); + dev = alloc_netdev(sizeof(struct net_device_stats), + name, rose_setup); + if (!dev) { + printk(KERN_ERR "ROSE: rose_proto_init - unable to allocate memory\n"); + while (--i >= 0) + kfree(dev_rose[i]); + return -ENOMEM; + } + dev_rose[i] = dev; + } for (i = 0; i < rose_ndevs; i++) { - sprintf(dev_rose[i].name, "rose%d", i); - dev_rose[i].init = rose_init; - register_netdev(&dev_rose[i]); + if (register_netdev(dev_rose[i])) { + printk(KERN_ERR "ROSE: netdevice regeistration failed\n"); + while (--i >= 0) { + unregister_netdev(dev_rose[i]); + kfree(dev_rose[i]); + return -EIO; + } + } + } sock_register(&rose_family_ops); @@ -1518,10 +1540,11 @@ sock_unregister(PF_ROSE); for (i = 0; i < rose_ndevs; i++) { - if (dev_rose[i].priv != NULL) { - kfree(dev_rose[i].priv); - dev_rose[i].priv = NULL; - unregister_netdev(&dev_rose[i]); + struct net_device *dev = dev_rose[i]; + + if (dev) { + unregister_netdev(dev); + kfree(dev); } } diff -Nru a/net/rose/rose_dev.c b/net/rose/rose_dev.c --- a/net/rose/rose_dev.c Tue Aug 5 14:35:52 2003 +++ b/net/rose/rose_dev.c Tue Aug 5 14:35:52 2003 @@ -165,7 +165,7 @@ return (struct net_device_stats *)dev->priv; } -int rose_init(struct net_device *dev) +void rose_setup(struct net_device *dev) { SET_MODULE_OWNER(dev); dev->mtu = ROSE_MAX_PACKET_SIZE - 2; @@ -182,13 +182,5 @@ /* New-style flags. */ dev->flags = 0; - - if ((dev->priv = kmalloc(sizeof(struct net_device_stats), GFP_KERNEL)) == NULL) - return -ENOMEM; - - memset(dev->priv, 0, sizeof(struct net_device_stats)); - dev->get_stats = rose_get_stats; - - return 0; -}; +} From shemminger@osdl.org Tue Aug 5 14:57:16 2003 Received: with ECARTIS (v1.0.0; list netdev); Tue, 05 Aug 2003 14:57:23 -0700 (PDT) Received: from mail.osdl.org (fw.osdl.org [65.172.181.6]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h75LvFFl007116 for ; Tue, 5 Aug 2003 14:57:16 -0700 Received: from dell_ss3.pdx.osdl.net (dell_ss3.pdx.osdl.net [172.20.1.60]) by mail.osdl.org (8.11.6/8.11.6) with SMTP id h75LuwI03721; Tue, 5 Aug 2003 14:56:58 -0700 Date: Tue, 5 Aug 2003 14:56:58 -0700 From: Stephen Hemminger To: Ralf Baechle , "David S. Miller" Cc: linux-hams@vger.kernel.org, netdev@oss.sgi.com Subject: [PATCH] Fix use after free in AX.25 Message-Id: <20030805145658.1b3f194b.shemminger@osdl.org> Organization: Open Source Development Lab X-Mailer: Sylpheed version 0.9.3claws (GTK+ 1.2.10; i686-pc-linux-gnu) X-Face: &@E+xe?c%:&e4D{>f1O<&U>2qwRREG5!}7R4;D<"NO^UI2mJ[eEOA2*3>(`Th.yP,VDPo9$ /`~cw![cmj~~jWe?AHY7D1S+\}5brN0k*NE?pPh_'_d>6;XGG[\KDRViCfumZT3@[ Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit X-archive-position: 4552 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: shemminger@osdl.org Precedence: bulk X-list: netdev This patch is against 2.6.0-test2. The problem is that the ax25_destroy_socket function frees the socket buffer, but then ax25_release dereferences this causing an OOPS. To reproduce: modprobe ax25; ifconfig -a Replaced sk_free with sock_put which will free if this is the last reference. diff -urNp -X dontdiff net-2.5/net/ax25/af_ax25.c linux-2.5-net/net/ax25/af_ax25.c --- net-2.5/net/ax25/af_ax25.c 2003-08-04 09:32:21.000000000 -0700 +++ linux-2.5-net/net/ax25/af_ax25.c 2003-08-05 14:34:21.000000000 -0700 @@ -349,7 +349,7 @@ void ax25_destroy_socket(ax25_cb *ax25) ax25->timer.data = (unsigned long)ax25; add_timer(&ax25->timer); } else { - sk_free(ax25->sk); + sock_put(ax25->sk); } } else { ax25_free_cb(ax25); @@ -944,15 +944,13 @@ static int ax25_release(struct socket *s switch (ax25->state) { case AX25_STATE_0: ax25_disconnect(ax25, 0); - ax25_destroy_socket(ax25); - break; + goto drop; case AX25_STATE_1: case AX25_STATE_2: ax25_send_control(ax25, AX25_DISC, AX25_POLLON, AX25_COMMAND); ax25_disconnect(ax25, 0); - ax25_destroy_socket(ax25); - break; + goto drop; case AX25_STATE_3: case AX25_STATE_4: @@ -995,13 +993,16 @@ static int ax25_release(struct socket *s sk->sk_shutdown |= SEND_SHUTDOWN; sk->sk_state_change(sk); sock_set_flag(sk, SOCK_DEAD); - ax25_destroy_socket(ax25); + goto drop; } sock->sk = NULL; sk->sk_socket = NULL; /* Not used, but we should do this */ release_sock(sk); - + return 0; + drop: + release_sock(sk); + ax25_destroy_socket(ax25); return 0; } From shemminger@osdl.org Tue Aug 5 15:01:26 2003 Received: with ECARTIS (v1.0.0; list netdev); Tue, 05 Aug 2003 15:01:30 -0700 (PDT) Received: from mail.osdl.org (fw.osdl.org [65.172.181.6]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h75M1QFl007706 for ; Tue, 5 Aug 2003 15:01:26 -0700 Received: from dell_ss3.pdx.osdl.net (dell_ss3.pdx.osdl.net [172.20.1.60]) by mail.osdl.org (8.11.6/8.11.6) with SMTP id h75M1AI04456; Tue, 5 Aug 2003 15:01:10 -0700 Date: Tue, 5 Aug 2003 15:01:10 -0700 From: Stephen Hemminger To: Henner Eisen , "David S. Miller" Cc: linux-x25@vger.kernel.org, netdev@oss.sgi.com Subject: [PATCH] Fix X.25 use after free. Message-Id: <20030805150110.0e2753ab.shemminger@osdl.org> Organization: Open Source Development Lab X-Mailer: Sylpheed version 0.9.3claws (GTK+ 1.2.10; i686-pc-linux-gnu) X-Face: &@E+xe?c%:&e4D{>f1O<&U>2qwRREG5!}7R4;D<"NO^UI2mJ[eEOA2*3>(`Th.yP,VDPo9$ /`~cw![cmj~~jWe?AHY7D1S+\}5brN0k*NE?pPh_'_d>6;XGG[\KDRViCfumZT3@[ Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit X-archive-position: 4553 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: shemminger@osdl.org Precedence: bulk X-list: netdev The conversion from cli/sti to locking in X.25 must not have been tested on a real SMP with memory debugging enabled. It OOPS right away if I do: modprobe x25; ifconfig -a The problem is that it dereferences the socket after it has already been freed. The fix for this is to make the call to sock_put, later in x25_destroy_socket do the free. Also, need a go to avoid references in x25_release. This patch is against 2.6.0-test2. diff -urNp -X dontdiff net-2.5/net/x25/af_x25.c linux-2.5-net/net/x25/af_x25.c --- net-2.5/net/x25/af_x25.c 2003-08-01 11:12:02.000000000 -0700 +++ linux-2.5-net/net/x25/af_x25.c 2003-08-05 12:14:42.000000000 -0700 @@ -350,8 +350,11 @@ void x25_destroy_socket(struct sock *sk) sk->sk_timer.function = x25_destroy_timer; sk->sk_timer.data = (unsigned long)sk; add_timer(&sk->sk_timer); - } else - sk_free(sk); + } else { + /* drop last reference so sock_put will free */ + __sock_put(sk); + } + release_sock(sk); sock_put(sk); } @@ -553,7 +556,7 @@ static int x25_release(struct socket *so case X25_STATE_2: x25_disconnect(sk, 0, 0, 0); x25_destroy_socket(sk); - break; + goto out; case X25_STATE_1: case X25_STATE_3: From felix@allot.com Tue Aug 5 15:14:46 2003 Received: with ECARTIS (v1.0.0; list netdev); Tue, 05 Aug 2003 15:14:49 -0700 (PDT) Received: from mxout1.netvision.net.il (mxout1.netvision.net.il [194.90.9.20]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h75MEiFl008833 for ; Tue, 5 Aug 2003 15:14:45 -0700 Received: from exg.allot.com ([199.203.223.202]) by mxout1.netvision.net.il (iPlanet Messaging Server 5.2 HotFix 1.14 (built Mar 18 2003)) with ESMTP id <0HJ5001RIL0W5P@mxout1.netvision.net.il> for netdev@oss.sgi.com; Tue, 05 Aug 2003 18:43:44 +0300 (IDT) Received: from allot.com (199.203.223.201 [199.203.223.201]) by exg.allot.com with SMTP (Microsoft Exchange Internet Mail Service Version 5.5.2653.13) id QG1CBDM4; Tue, 05 Aug 2003 18:46:45 +0200 Date: Tue, 05 Aug 2003 18:44:10 +0300 From: Felix Radensky Subject: Re: e100 "Ferguson" release To: "Feldman, Scott" Cc: Ben Greear , Jeff Garzik , netdev@oss.sgi.com Message-id: <3F2FD0CA.1080403@allot.com> Organization: Allot Communications Ltd. MIME-version: 1.0 Content-type: multipart/alternative; boundary="Boundary_(ID_Lg9l6CsjxHY6kAV0DHDKJw)" X-Accept-Language: en-us, en User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.0.2) Gecko/20030208 Netscape/7.02 References: X-archive-position: 4554 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: felix@allot.com Precedence: bulk X-list: netdev --Boundary_(ID_Lg9l6CsjxHY6kAV0DHDKJw) Content-type: text/plain; charset=us-ascii; format=flowed Content-transfer-encoding: 7BIT Hi, Scott This change seems to fix the problem. Thanks a lot ! Felix. Feldman, Scott wrote: >>I've also noticed that the number of hard_start_xmit failures >>in e1000 has increased significantly in version 5.1.13-k1. In >>version 5.0.43-k1 the number of failures was much smaller. >> >> > >Interesting. Felix, would you undo the change[1] below in 5.1.13-k1 and >see what happens? With the change below, 5.1.13 would be more >aggressive on Tx cleanup, so we'll be quicker waking the queue than >before. > >-scott > > for(i = 0; i < E1000_MAX_INTR; i++) >- if(!e1000_clean_rx_irq(adapter) && >+ if(!e1000_clean_rx_irq(adapter) & > !e1000_clean_tx_irq(adapter)) > break; > >[1] Something still bothers me about this new form where we're mixing a >bit-wise operator with logical operands. Should this bother me? > > > --Boundary_(ID_Lg9l6CsjxHY6kAV0DHDKJw) Content-type: text/html; charset=us-ascii Content-transfer-encoding: 7BIT Hi, Scott

This change seems to fix the problem.
Thanks a lot !

Felix.

Feldman, Scott wrote:
I've also noticed that the number of hard_start_xmit failures 
in e1000 has increased significantly in version 5.1.13-k1. In 
version 5.0.43-k1 the number of failures was much smaller.
    

Interesting.  Felix, would you undo the change[1] below in 5.1.13-k1 and
see what happens?  With the change below, 5.1.13 would be more
aggressive on Tx cleanup, so we'll be quicker waking the queue than
before. 

-scott

        for(i = 0; i < E1000_MAX_INTR; i++)
-               if(!e1000_clean_rx_irq(adapter) &&
+               if(!e1000_clean_rx_irq(adapter) &
                   !e1000_clean_tx_irq(adapter))
                        break;

[1] Something still bothers me about this new form where we're mixing a
bit-wise operator with logical operands.  Should this bother me?

  

--Boundary_(ID_Lg9l6CsjxHY6kAV0DHDKJw)-- From nf@hipac.org Tue Aug 5 15:23:25 2003 Received: with ECARTIS (v1.0.0; list netdev); Tue, 05 Aug 2003 15:23:34 -0700 (PDT) Received: from smtprelay02.ispgateway.de (smtprelay02.ispgateway.de [62.67.200.157]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h75MNNFl009687 for ; Tue, 5 Aug 2003 15:23:24 -0700 Received: (qmail 18331 invoked from network); 5 Aug 2003 22:23:20 -0000 Received: from unknown (HELO portal.lan) (134300@[80.138.239.123]) (envelope-sender ) by smtprelay02.ispgateway.de (qmail-ldap-1.03) with SMTP for ; 5 Aug 2003 22:23:20 -0000 Received: from hipac.org (tmobile.lan [192.168.0.6]) by portal.lan (Postfix) with ESMTP id 235E14B0B6; Tue, 5 Aug 2003 22:46:13 +0200 (CEST) Message-ID: <3F302E04.1090503@hipac.org> Date: Wed, 06 Aug 2003 00:21:56 +0200 From: Michael Bellion and Thomas Heinz User-Agent: Mozilla/5.0 (X11; U; Linux i686; de-AT; rv:1.4) Gecko/20030714 Debian/1.4-2 X-Accept-Language: de, en MIME-Version: 1.0 To: hadi@cyberus.ca Cc: linux-net@vger.kernel.org, netdev@oss.sgi.com Subject: Re: [RFC] High Performance Packet Classifiction for tc framework References: <200307141045.40999.nf@hipac.org> <1058328537.1797.24.camel@jzny.localdomain> <3F16A0E5.1080007@hipac.org> <1059934468.1103.41.camel@jzny.localdomain> <3F2E5CD6.4030500@hipac.org> <1060012260.1103.380.camel@jzny.localdomain> In-Reply-To: <1060012260.1103.380.camel@jzny.localdomain> X-Enigmail-Version: 0.76.2.0 X-Enigmail-Supports: pgp-inline, pgp-mime Content-Type: multipart/signed; micalg=pgp-sha1; protocol="application/pgp-signature"; boundary="------------enig2AA484285077C06548045724" X-archive-position: 4555 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: nf@hipac.org Precedence: bulk X-list: netdev This is an OpenPGP/MIME signed message (RFC 2440 and 3156) --------------enig2AA484285077C06548045724 Content-Type: text/plain; charset=us-ascii; format=flowed Content-Transfer-Encoding: 7bit Hi Jamal You wrote: > I promise i will. I dont think i will do it justice spending 5 minutes > on it. I take it you have written extensive docs too ;-> Of course ;-) Well, actually we are going to present an overview of the hipac algorithm at the netfilter developer workshop in Budapest. Hope to see you there. > Unfortunately it is more exciting to write code than documents. I almost > got someone to document at least its proper usage but they backed away > at the last minute. lol > I dont wanna go in a lot of details, but one important detail is that > keynodes can also lead to other hash tables. So you can split the packet > parsing across multiple hashes - this is where the comparison with > chains comes in. There are several ways to do this. I'll show you the > brute force way and you can make it more usable with "hashkey" and > "sample" operator. Stealing from your example: > > [example snipped] > > Makes sense? Yes, it does. Still the question is how to solve this generally. Consider the following example ruleset: 1) src ip 10.0.0.0/30 dst ip 20.0.0.0/20 2) src ip 10.0.0.0/28 dst ip 20.0.0.0/22 3) src ip 10.0.0.0/26 dst ip 20.0.0.0/24 4) src ip 10.0.0.0/24 dst ip 20.0.0.0/26 5) src ip 10.0.0.0/22 dst ip 20.0.0.0/28 6) src ip 10.0.0.0/20 dst ip 20.0.0.0/30 So you have 1 src ip hash and #buckets(src ip hash) many dst ip hashes. In order to achieve maximum performance you have to minimize the number of collisions in the hash buckets. How would you choose the hash function and what would the construction look like? In principle the tree of hashes approach is capable to express a general access list like ruleset, i.e. a set of terminal rules with different priorities. The problem is that the approach is only efficient if the number of collisions is O(1) -> no amortized analysis but rather per bucket. In theory you can do the following. Let's consider one dimension. The matches in one dimension form a set of elementary intervals which are overlapped by certain rules. Example: |------| |---------| |----------------| |------------------| |---------------| |----|---|--|---|-----|---|----|-------|--|------|-------| The '|-----|' reflect the matches and the bottom line represents the set of elementary intervals introduced by the matches. Now, you can decide for each elementary interval which rule matches since the rules are terminal and uniquely prioritized. The next step would be to create a hash with #elementary intervals many buckets and create a hash function which maps the keys to the appropriate buckets like in the picture. In this case you have exactly 1 entry per hash bucket. Sounds fine BUT it is not possible to generically deduce an easily (= fast) computable hash function with the described requirements. BTW, this approach can be extended to 2 or more dimensions where the hash function for each hash has to meet the requirement. Of course this information is not very helpful since the problem of defining appropriate hash functions remains ;) Obviously this way is not viable but supposedly the only one to achieve ultimate performance with the tree of hashes concept. BTW, the way hipac works is basically not so different from the idea described above but since we use efficient btrees we don't have to define hash functions. > sure position could be used as a priority. It is easier/intuitive to > just have explicit priorities. Merely a matter of taste. The way iptables and nf-hipac use priorities is somewhat more dynamic than the tc way because they are automatically adjusted if a rule is inserted in between others. > What "optimizes" could be a user interface or the thread i was talking > about earlier. Hm, this rebalancing is not clear to us. Do you want to rebalance the tree of hashes? This seems a little strange at the first glance because the performance of the tree of hashes is dominated by the number of collisions that need to be resolved and not the depth of the tree. > Is your plan to put this in other places other than Linux? Currently we are working on the integration in linux. In general the hipac core is OS and application independent, so basically it could also be used for some userspace program which is related to classification and of course in other OS's. Any special reason why you are asking this question? > So you got this thought from iptables and took it to the next level? Well, in order to support iptables matches and targets we had to create an appropriate abstraction for them on the hipac layer. This abstraction can also be used for tc classifiers if the tcf_result is ignored, i.e. you just consider whether the filter matched or not. > I am still not sure i understand why not use what already exists - but > i'll just say i dont see it right now. If hipac had no support for embedded classifiers you couldn't express a ruleset like: 1) [native hipac matches] [u32 filter] [classid] 2) [native hipac matches] [classid] You would have to construct rule 1) in a way that it "jumps" to an external u32 filter. Unfortunately, you cannot jump back to the hipac filter again in case the u32 filter does not match so rule 2) is unreachable. This problem is caused by the fact that cls_hipac can occur at most once per interface. > It doesnt appear harmful to leave it there without destroying it. > The next time someome adds a filter of the same protocol + priority, it > will already exist. If you want to be accurate (because it does get > destroyed when the init() fails), then destroy it but you need to put > checks for "incase we have added a new tcf_proto" which may not look > pretty. Is this causing you some discomfort? No, actually not. Regards, +-----------------------+----------------------+ | Michael Bellion | Thomas Heinz | | | | +-----------------------+----------------------+ | High Performance Packet Classification | | nf-hipac: http://www.hipac.org/ | +----------------------------------------------+ --------------enig2AA484285077C06548045724 Content-Type: application/pgp-signature -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.2.2 (GNU/Linux) Comment: Using GnuPG with Debian - http://enigmail.mozdev.org iD8DBQE/MC4FtXh2AYIMjggRAvm5AJ4r5t7eKXHNt/mWCIcS93+l/Gh+tgCdH82Z 76Nh+wx5v75reDsjfY1SJY4= =NW50 -----END PGP SIGNATURE----- --------------enig2AA484285077C06548045724-- From shemminger@osdl.org Tue Aug 5 15:43:54 2003 Received: with ECARTIS (v1.0.0; list netdev); Tue, 05 Aug 2003 15:44:05 -0700 (PDT) Received: from mail.osdl.org (fw.osdl.org [65.172.181.6]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h75MhrFl011720 for ; Tue, 5 Aug 2003 15:43:54 -0700 Received: from dell_ss3.pdx.osdl.net (dell_ss3.pdx.osdl.net [172.20.1.60]) by mail.osdl.org (8.11.6/8.11.6) with SMTP id h75MhZI14668; Tue, 5 Aug 2003 15:43:36 -0700 Date: Tue, 5 Aug 2003 15:43:35 -0700 From: Stephen Hemminger To: Henner Eisen , "David S. Miller" , linux-x25@vger.kernel.org, netdev@oss.sgi.com Subject: [PATCH] X.25 async net_device fixup Message-Id: <20030805154335.7abfcb92.shemminger@osdl.org> Organization: Open Source Development Lab X-Mailer: Sylpheed version 0.9.3claws (GTK+ 1.2.10; i686-pc-linux-gnu) X-Face: &@E+xe?c%:&e4D{>f1O<&U>2qwRREG5!}7R4;D<"NO^UI2mJ[eEOA2*3>(`Th.yP,VDPo9$ /`~cw![cmj~~jWe?AHY7D1S+\}5brN0k*NE?pPh_'_d>6;XGG[\KDRViCfumZT3@[ Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit X-archive-position: 4556 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: shemminger@osdl.org Precedence: bulk X-list: netdev Convert X.25 async driver to have dynamic net_device's. This driver is a lot like SLIP so the code changes were similar. - Added similar locking to SLIP - replaced code that snooped for MTU changes with the net_device change mtu callback. - cleaned up the statistics by using the net_device_stats structure. Patch is against 2.6.0-test2. Not sure if anyone ever uses this. I tested by bringing up an x.25 async line using a modified version of slattach. diff -urNp -X dontdiff net-2.5/drivers/net/wan/x25_asy.c linux-2.5-net/drivers/net/wan/x25_asy.c --- net-2.5/drivers/net/wan/x25_asy.c 2003-08-01 11:11:42.000000000 -0700 +++ linux-2.5-net/drivers/net/wan/x25_asy.c 2003-07-31 13:22:41.000000000 -0700 @@ -34,81 +34,67 @@ #include #include "x25_asy.h" -typedef struct x25_ctrl { - struct x25_asy ctrl; /* X.25 things */ - struct net_device dev; /* the device */ -} x25_asy_ctrl_t; - -static x25_asy_ctrl_t **x25_asy_ctrls = NULL; - -int x25_asy_maxdev = SL_NRUNIT; /* Can be overridden with insmod! */ +static struct net_device **x25_asy_devs; +static int x25_asy_maxdev = SL_NRUNIT; MODULE_PARM(x25_asy_maxdev, "i"); MODULE_LICENSE("GPL"); static int x25_asy_esc(unsigned char *p, unsigned char *d, int len); static void x25_asy_unesc(struct x25_asy *sl, unsigned char c); +static void x25_asy_setup(struct net_device *dev); /* Find a free X.25 channel, and link in this `tty' line. */ -static inline struct x25_asy *x25_asy_alloc(void) +static struct x25_asy *x25_asy_alloc(void) { - x25_asy_ctrl_t *slp = NULL; + struct net_device *dev = NULL; + struct x25_asy *sl; int i; - if (x25_asy_ctrls == NULL) + if (x25_asy_devs == NULL) return NULL; /* Master array missing ! */ - for (i = 0; i < x25_asy_maxdev; i++) - { - slp = x25_asy_ctrls[i]; + for (i = 0; i < x25_asy_maxdev; i++) { + dev = x25_asy_devs[i]; + /* Not allocated ? */ - if (slp == NULL) + if (dev == NULL) break; + + sl = dev->priv; /* Not in use ? */ - if (!test_and_set_bit(SLF_INUSE, &slp->ctrl.flags)) - break; + if (!test_and_set_bit(SLF_INUSE, &sl->flags)) + return sl; } - /* SLP is set.. */ + /* Sorry, too many, all slots in use */ if (i >= x25_asy_maxdev) return NULL; /* If no channels are available, allocate one */ - if (!slp && - (x25_asy_ctrls[i] = (x25_asy_ctrl_t *)kmalloc(sizeof(x25_asy_ctrl_t), - GFP_KERNEL)) != NULL) { - slp = x25_asy_ctrls[i]; - memset(slp, 0, sizeof(x25_asy_ctrl_t)); + if (!dev) { + char name[IFNAMSIZ]; + sprintf(name, "x25asy%d", i); + + dev = alloc_netdev(sizeof(struct x25_asy), + name, x25_asy_setup); + if (!dev) + return NULL; /* Initialize channel control data */ - set_bit(SLF_INUSE, &slp->ctrl.flags); - slp->ctrl.tty = NULL; - sprintf(slp->dev.name, "x25asy%d", i); - slp->dev.base_addr = i; - slp->dev.priv = (void*)&(slp->ctrl); - slp->dev.next = NULL; - slp->dev.init = x25_asy_init; - } - if (slp != NULL) - { + sl = dev->priv; + dev->base_addr = i; /* register device so that it can be ifconfig'ed */ - /* x25_asy_init() will be called as a side-effect */ - /* SIDE-EFFECT WARNING: x25_asy_init() CLEARS slp->ctrl ! */ - - if (register_netdev(&(slp->dev)) == 0) - { + if (register_netdev(dev) == 0) { /* (Re-)Set the INUSE bit. Very Important! */ - set_bit(SLF_INUSE, &slp->ctrl.flags); - slp->ctrl.dev = &(slp->dev); - slp->dev.priv = (void*)&(slp->ctrl); - return (&(slp->ctrl)); - } - else - { - clear_bit(SLF_INUSE,&(slp->ctrl.flags)); + set_bit(SLF_INUSE, &sl->flags);